Postgresql – Is it possible to have two copies of a table clustered in different arrangements seamlessly

performancepostgresqlquery-performance

I have a table with around 100M rows. It only gets data inserted once/day but we need to do selects a lot. The selects are usually simple but need to return 100s of thousands of rows sometimes.

It's unique based on three columns node_id, pricedate, hour which are integer, timestamp, integer respectively. It was slow for most queries but I clustered it to node_id, pricedate and that fixed the slowness for most queries. Those queries were of the type:

select * from mytable where node_id in (1,2,3,4)

We still occasionally need to do queries like:

select * from mytable where pricedate>='2016-05-01'

These are still slow because it's clustered by node_id first. We have an index on pricedate already. The issue is that the users often need enough data that the query engine throws out the index and uses a seq scan. Once it's using a seq scan, it benefits greatly from having the data clustered in the way that it is being queried. This leads to the problem that I have where some queries benefit from one clustering and other queries from the other:

It would be nice if there was a way to have two physical copies of the table where one copy is clustered one way and the other is clustered another but user access to it appears as though there is only 1 table and the DB engine would ensure they're in sync. Obviously there'd be write penalties in doing this but that's inconsequential for our usage.

Would something like this be possible?

I'm guessing there isn't a built in way to do what I describe. To do it anyway I guess I'd make a table called mytable_dup with the same unique key constraint but with the alternate clustering and then setup triggers to insert to it whenever the master is inserted/updated/deleted. That seems doable but from here, would there be a reasonable way to select from the duplicated table that will be efficient?

I'm running PostgreSQL 9.4 at home and 9.5 on Google.

Best Answer

To have the data held in two different physical sequences one must store the data twice. This can be achieved by defining a second, covering index. A covering index contains all of the columns required by a query. This way the optimiser has no need to refer to the base table to read further values and is unlikely to revert to a base table scan for the query plan. The optimiser performs an index-only scan. As the choice of index is made by the optimiser and not be the programmer no application code need be altered to take advantage during reads. No further objects are required to maintain consistency during writes.

The columns used in the WHERE clause will be the leading columns of the index. The sequence of the other columns is unimportant. As and when PostgreSQL supports the INCLUDE syntax this index can be changed to use it.

The downsides include a) extra disk to store this data b) additional latency during writes to maintain the index c) more system maintenance required for reorgs and such, and d) as queries change the covering index(es) have to change to match e) correspondingly larger and longer backups and restores.