Here's what I would do. I hope I did understand your question correctly. It's a workaround and perhaps not very elegant:
- create a random field, and then an AUTO_INCREMENT field ("ID") as the single field of a primary key, order by RND etc
- create a temporary table with MIN(ID) for each combination of Col1/2 (first, but randomly):
SELECT
MIN(ID) AS MIN_ID, Col1, Col2
FROM
table
GROUP BY
Col1, Col2
- add a second INT field to be the first_index
- update the table and set first_index to be the difference between the minimum value and the running value of the autoincremented field:
UPDATE
table t, temp_table tmp
SET
t.first_index = (t.ID - tmp.MIN_ID) + 1
WHERE
t.Col1=tmp.Col1 AND t.Col2=tmp.Col2
- optionally remove the random and ID field, and add a new PK with the three fields Col1/Col2/first_index (auto incremented) for following inserts
Postgres physically writes tuples sequentially as INSERTed. If you do that to a new table or a table without dead tuples, you get just the same result you would achieve with CLUSTER
on an index with the same sort order as your INSERT
. The effect of CLUSTER
deteriorates with later writes to the table in just the same way (and stays intact while you never DELETE
or UPDATE
- or INSERT
breaking the desired order).
Some answers focus on those effects from later writes and miss the point of the question. The answer to your question is basically:
YES, inserting in order does have the same effect as clustering.
based on some conditions:
queries against this table will be filtered down by a foreign key ID column.
Meaning you access row with the same ID at once, not a range of sequential IDs. Then all you need is tuples clustered per ID, the physical order between IDs is meaningless and irrelevant.
And:
generating this data in chunks related to the foreign key
Meaning, "chunks" include all rows for the same ID in sequence. There are no other rows for the same ID inserted sooner or later. So something like:
INSERT INTO big_table (source_table1_id,a,b,c)
SELECT s1.source_table1_id, ...
FROM source_table1 s1
...
WHERE s1.source_table1_id BETWEEN 123 and 125 -- example
ORDER BY s1.source_table1_id
And:
I won't ever need to UPDATE
it in normal usage; portions may be
deleted and regenerated.
The part about DELETE
is the only mildly problematic part. If you would never delete either, you would be done here. If by "portions" you mean all rows with said ID
at once, you are still good, mostly. While deleting and inserting in the same transaction, there is no fragmentation within IDs. (Deleted tuples are not "dead" yet and not overwritten in the same transaction.)
Dead tuples start bloating the table, and later inserts can start filling physical holes, which is where fragmentation can start. The bloat with dead tuples has various accumulating bad effects, but index access on all rows for a given ID is mostly unaffected.
But all that is orthogonal to your question, since the same considerations apply to CLUSTER
.
Have you considered pg_repack, which can do the same as CLUSTER
, just without exclusive lock on the table. They just added Postgres 9.6 to the list of supported versions this week.
Related:
Best Answer
Your syntax won't work with PostgreSQL, but I can answer your question for PostgreSQL anyway:
Your query will often work as you intend because the three statements will often use the same execution plan. There is no guarantee for that though.
One possibility where even a sequential scan in PostgreSQL will return a different order is if there is already a sequential scan of the table in progress: Then PostgreSQL will start scanning where the other statement is currently scanning the table, i.e. somewhere in the middle, so that both scan can execute somewhat synchronized to reduce the I/O operations necessary.