PostgreSQL – Does Inserting in Order Have Same Effect as Clustering?

performancepostgresql

I have a table that will contain some pre-computed data based off other tables. (Computing on the fly is too computationally expensive given the size of data I have to deal with.) I will be generating incrementally as source data is added. (I won't ever need to UPDATE it in normal usage; portions may be deleted and regenerated.) The table will be fairly large. It's currently about 50 million rows and will grow every year.

Most of the queries against this table will be filtered down by a foreign key ID column. As such, they perform better if all the rows for that ID are grouped into the same pages. I can guarantee this ordering on disk by creating an index and calling CLUSTER periodically, but this is obviously less than ideal as it would require some kind of scheduled task, coordinating against usage and other scheduled tasks, etc.

However, since I'm generating this data in chunks related to the foreign key I want to CLUSTER by, I can easily put an ORDER BY clause on the INSERT command:

INSERT INTO big_table (source_table1_id,a,b,c)
SELECT
   source_table1_id,
   5 /* some formula */,
   /* ... */
FROM source_table1
JOIN source_table2 ON ...
...
WHERE ... /* some condition indicating what needs to be generated */
ORDER BY source_table1_id

Will this affect the on disk storage order, grouping the rows into close to the minimum number of pages? And if it does, are there other processes that could mess up the on disk order later?

I am currently using PostgreSQL 9.3, but I'd like to know about newer versions as well for upgrades.

Best Answer

Postgres physically writes tuples sequentially as INSERTed. If you do that to a new table or a table without dead tuples, you get just the same result you would achieve with CLUSTER on an index with the same sort order as your INSERT. The effect of CLUSTER deteriorates with later writes to the table in just the same way (and stays intact while you never DELETE or UPDATE - or INSERT breaking the desired order).

Some answers focus on those effects from later writes and miss the point of the question. The answer to your question is basically:

YES, inserting in order does have the same effect as clustering.

based on some conditions:

queries against this table will be filtered down by a foreign key ID column.

Meaning you access row with the same ID at once, not a range of sequential IDs. Then all you need is tuples clustered per ID, the physical order between IDs is meaningless and irrelevant.

And:

generating this data in chunks related to the foreign key

Meaning, "chunks" include all rows for the same ID in sequence. There are no other rows for the same ID inserted sooner or later. So something like:

INSERT INTO big_table (source_table1_id,a,b,c)
SELECT s1.source_table1_id, ... 
FROM   source_table1 s1
...
WHERE s1.source_table1_id BETWEEN 123 and 125 -- example
ORDER BY s1.source_table1_id

And:

I won't ever need to UPDATE it in normal usage; portions may be deleted and regenerated.

The part about DELETE is the only mildly problematic part. If you would never delete either, you would be done here. If by "portions" you mean all rows with said ID at once, you are still good, mostly. While deleting and inserting in the same transaction, there is no fragmentation within IDs. (Deleted tuples are not "dead" yet and not overwritten in the same transaction.)

Dead tuples start bloating the table, and later inserts can start filling physical holes, which is where fragmentation can start. The bloat with dead tuples has various accumulating bad effects, but index access on all rows for a given ID is mostly unaffected.

But all that is orthogonal to your question, since the same considerations apply to CLUSTER.

Have you considered pg_repack, which can do the same as CLUSTER, just without exclusive lock on the table. They just added Postgres 9.6 to the list of supported versions this week.

Related: