I have a table that will contain some pre-computed data based off other tables. (Computing on the fly is too computationally expensive given the size of data I have to deal with.) I will be generating incrementally as source data is added. (I won't ever need to UPDATE
it in normal usage; portions may be deleted and regenerated.) The table will be fairly large. It's currently about 50 million rows and will grow every year.
Most of the queries against this table will be filtered down by a foreign key ID column. As such, they perform better if all the rows for that ID are grouped into the same pages. I can guarantee this ordering on disk by creating an index and calling CLUSTER
periodically, but this is obviously less than ideal as it would require some kind of scheduled task, coordinating against usage and other scheduled tasks, etc.
However, since I'm generating this data in chunks related to the foreign key I want to CLUSTER
by, I can easily put an ORDER BY
clause on the INSERT
command:
INSERT INTO big_table (source_table1_id,a,b,c)
SELECT
source_table1_id,
5 /* some formula */,
/* ... */
FROM source_table1
JOIN source_table2 ON ...
...
WHERE ... /* some condition indicating what needs to be generated */
ORDER BY source_table1_id
Will this affect the on disk storage order, grouping the rows into close to the minimum number of pages? And if it does, are there other processes that could mess up the on disk order later?
I am currently using PostgreSQL 9.3, but I'd like to know about newer versions as well for upgrades.
Best Answer
Postgres physically writes tuples sequentially as INSERTed. If you do that to a new table or a table without dead tuples, you get just the same result you would achieve with
CLUSTER
on an index with the same sort order as yourINSERT
. The effect ofCLUSTER
deteriorates with later writes to the table in just the same way (and stays intact while you neverDELETE
orUPDATE
- orINSERT
breaking the desired order).Some answers focus on those effects from later writes and miss the point of the question. The answer to your question is basically:
YES, inserting in order does have the same effect as clustering.
based on some conditions:
Meaning you access row with the same ID at once, not a range of sequential IDs. Then all you need is tuples clustered per ID, the physical order between IDs is meaningless and irrelevant.
And:
Meaning, "chunks" include all rows for the same ID in sequence. There are no other rows for the same ID inserted sooner or later. So something like:
And:
The part about
DELETE
is the only mildly problematic part. If you would never delete either, you would be done here. If by "portions" you mean all rows with saidID
at once, you are still good, mostly. While deleting and inserting in the same transaction, there is no fragmentation within IDs. (Deleted tuples are not "dead" yet and not overwritten in the same transaction.)Dead tuples start bloating the table, and later inserts can start filling physical holes, which is where fragmentation can start. The bloat with dead tuples has various accumulating bad effects, but index access on all rows for a given ID is mostly unaffected.
But all that is orthogonal to your question, since the same considerations apply to
CLUSTER
.Have you considered pg_repack, which can do the same as
CLUSTER
, just without exclusive lock on the table. They just added Postgres 9.6 to the list of supported versions this week.Related: