PostgreSQL – Impact of CLUSTER on Performance

index-tuningperformancepostgresqlpostgresql-9.2postgresql-performancestorage

I'm trying to optimize my Postgres 9.2 database to speed up queries with date restrictions.

I have a timestamp column, but mostly I'm asking for some day, so I have created an index with timestamp to date parsing:

CREATE INDEX foo_my_timestamp_idx
ON foo
USING btree
((my_timestamp::date) DESC);

Now, to increase a performance I CLUSTER foo table using index above:

CLUSTER foo USING foo_my_timestamp_idx;

According to the manual on SQL-CLUSTER, the table

is physically reordered based on the index information

I wonder whether there is an impact on performance for other queries using a PK of table (let say id_foo). Are there any downsides?

Best Answer

Yes, there can be downsides. If another query looks at a different data segment not determined by the date, it might take a performance hit if rows are spread out over more data pages now. Just the same way as your first query profits. That completely depends on information not in your question.

other queries using a PK of table (let say id_foo)

That could be anything. It depends on what you have and what you query exactly. Querying a single row is not affected either way, but multiple rows might be.

Be aware that CLUSTER rewrites the table in pristine condition like VACUUM FULL does (removes dead tuples, compacts the physical size of the table, rewrites indexes) So you might see an immediate positive effect on read performance independent of the sort order. (Much like you would get with VACUUM FULL.)
After CLUSTER, you may want to run a plain VACUUM on the table to update the visibility map, too - which may allow index-only scans.

All benefits of CLUSTER shrink with the write frequency.

Also, if you have many updates to the table, CLUSTER can actually hurt write performance by removing "wiggle room" for HOT updates on the same data page. You might be able to counter that effect with a FILLFACTOR setting below 100. Again, depends on locality of updated rows, etc.

Related:

Either way, I would probably not index and cluster on my_timestamp::date, but on my_timestamp directly. Nothing lost, something gained. The cast is very cheap, but it's still cheaper not to cast at all. And the index can support more queries.

CREATE INDEX foo_my_timestamp_idx ON foo (my_timestamp);

Even though a date occupies only 4 bytes on disk and a timestamp occupies 8 bytes, the difference is typically lost to alignment padding for your case, and both indexes have exactly the same size.

The order of multiple rows on the same day resulting from your expression index is arbitrary. There can still be two identical timestamps, but with 6 fractional digits that's normally very unlikely. Aside from that you get a deterministic order of rows, which can have various advantages.

I also dropped the DESC key word since Postgres can read indexes backwards virtually as fast a forwards. (Sort order matters for multicolumn indexes, though!) More:

Instead of:

SELECT * FROM foo
WHERE my_timestamp::date = '2016-07-25';

You would now use:

SELECT * FROM foo
WHERE  my_timestamp >= '2016-07-25'  -- this is a timestamp literal now
WHERE  my_timestamp <  '2016-07-26';

Same performance.

If you don't need the time component of the column at all, convert the column to date ...

How to roll back CLUSTER?

CLUSTER on a single table can be rolled back with ROLLBACK like any other regular command as long as the transaction has not been committed.

However, I quote the manual:

CLUSTER without any parameter reclusters all the previously-clustered tables in the current database that the calling user owns, or all such tables if called by a superuser. This form of CLUSTER cannot be executed inside a transaction block.

You can always run CLUSTER with a different index to change the physical order of rows once more.