Postgresql – Caching results of query for part of table that almost never change

cachepostgresql

In our database we have a table with a very large number of rows. Let's call it Table1. The greatest difficulty for us is a COUNT query to this table with some conditions (see example below):

Query1

SELECT
    count (*)
FROM
    Table1
WHERE
    Table1.field1 = 'something'
    AND Table1.field2 = 'something'
    AND Table1.field3 = 'something'
    ..

In our application this query can occure with any number and any combination of conditions so we cannot use any index to somehow speed up this query so a simple sequence scan is being used (actually parallel sequence scan).

The feature of Table1 is as follows: new lines in it change very often and old ones almost never change (lets assume that the lines that often change is lines that created 'today' and the lines that almost never change is lines that created 'not today').

Because of this feature, the following idea emerged: if in our application occurs the need of execution of a Query1 (see above), we can actually not execute Query1, but instead we can execute 2 queries (see below): one for new rows (Query2) and one for old rows (Query3) and further we can summarized the results of these two queries to get total COUNT result. And to speed up overall performance we can perform Caching the results of query for old rows (Query3). Yes, we should clear this cache when some UPDATE occure on old rows, but it happens very rare.

Query2

SELECT
    count (*)
FROM
    Table1
WHERE
    Table1.field1 = 'something'
    AND Table1.field2 = 'something'
    AND Table1.field3 = 'something'
    ..
    AND Table1.creation_date = 'today'

Query3

SELECT
    count (*)
FROM
    Table1
WHERE
    Table1.field1 = 'something'
    AND Table1.field2 = 'something'
    AND Table1.field3 = 'something'
    ..
    AND Table1.creation_date != 'today'

So the question is:

Is this a good idea and it can speed up performance or it is a bad idea and it is better not even consider this approach?

Best Answer

It's best not to use a count(*). Instead, use a count(PrimaryKey), where the Primarykey is whatever your PrimaryKey happens to be.

For example, if the PrimaryKey is EmplID, then use count(EmplID). If you don't have a PrimaryKey, create one. Without a PrimaryKey, the query will always use a table scan (where it looks at every single row to find the data) rather than using an indexed seek. Think of it like a filing cabinet. If all your records are organized alphabetically, you always know where to find customers whose names start with "F". If it's not organized, you have to look through every single piece of paper to find the right customer records.

For each of your WHERE conditions, make sure each field is indexed. Field1 should be indexed, Field2 should be indexed. It doesn't matter if the value changes. As long as the field is indexed and you REBUILD or RECREATE the index often, it should work just fine.

You shouldn't need Query2 and Query3. If you decide to use Query2, make sure the Creation_Date is indexed as well.