PostgreSQL – Does Inserting in Order Have Same Effect as Clustering?

performancepostgresql

I have a table that will contain some pre-computed data based off other tables. (Computing on the fly is too computationally expensive given the size of data I have to deal with.) I will be generating incrementally as source data is added. (I won't ever need to UPDATE it in normal usage; portions may be deleted and regenerated.) The table will be fairly large. It's currently about 50 million rows and will grow every year.

Most of the queries against this table will be filtered down by a foreign key ID column. As such, they perform better if all the rows for that ID are grouped into the same pages. I can guarantee this ordering on disk by creating an index and calling CLUSTER periodically, but this is obviously less than ideal as it would require some kind of scheduled task, coordinating against usage and other scheduled tasks, etc.

However, since I'm generating this data in chunks related to the foreign key I want to CLUSTER by, I can easily put an ORDER BY clause on the INSERT command:

INSERT INTO big_table (source_table1_id,a,b,c)
SELECT
   source_table1_id,
   5 /* some formula */,
   /* ... */
FROM source_table1
JOIN source_table2 ON ...
...
WHERE ... /* some condition indicating what needs to be generated */
ORDER BY source_table1_id

Will this affect the on disk storage order, grouping the rows into close to the minimum number of pages? And if it does, are there other processes that could mess up the on disk order later?

I am currently using PostgreSQL 9.3, but I'd like to know about newer versions as well for upgrades.

Best Answer

Postgres physically writes tuples sequentially as INSERTed. If you do that to a new table or a table without dead tuples, you get just the same result you would achieve with CLUSTER on an index with the same sort order as your INSERT. The effect of CLUSTER deteriorates with later writes to the table in just the same way (and stays intact while you never DELETE or UPDATE - or INSERT breaking the desired order).

Some answers focus on those effects from later writes and miss the point of the question. The answer to your question is basically:

YES, inserting in order does have the same effect as clustering.

based on some conditions:

queries against this table will be filtered down by a foreign key ID column.

Meaning you access row with the same ID at once, not a range of sequential IDs. Then all you need is tuples clustered per ID, the physical order between IDs is meaningless and irrelevant.

And:

generating this data in chunks related to the foreign key

Meaning, "chunks" include all rows for the same ID in sequence. There are no other rows for the same ID inserted sooner or later. So something like:

INSERT INTO big_table (source_table1_id,a,b,c)
SELECT s1.source_table1_id, ... 
FROM   source_table1 s1
...
WHERE s1.source_table1_id BETWEEN 123 and 125 -- example
ORDER BY s1.source_table1_id

And:

I won't ever need to UPDATE it in normal usage; portions may be deleted and regenerated.

The part about DELETE is the only mildly problematic part. If you would never delete either, you would be done here. If by "portions" you mean all rows with said ID at once, you are still good, mostly. While deleting and inserting in the same transaction, there is no fragmentation within IDs. (Deleted tuples are not "dead" yet and not overwritten in the same transaction.)

Dead tuples start bloating the table, and later inserts can start filling physical holes, which is where fragmentation can start. The bloat with dead tuples has various accumulating bad effects, but index access on all rows for a given ID is mostly unaffected.

But all that is orthogonal to your question, since the same considerations apply to CLUSTER.

Have you considered pg_repack, which can do the same as CLUSTER, just without exclusive lock on the table. They just added Postgres 9.6 to the list of supported versions this week.

MySQL 5.5.8 InnoDB Foreign Key, JOIN performance

Foreign key relationships are to enforce data integrity, not for query performance, that is what indexes are for. Also note that InnoDB creates an index on each column with a foreign key relationship.

However I would recommend having the foreign key relationships to ensure that the data is always valid, especially when updating and deleting, which may become significant when you have to start archiving data.

Postgresql – How un-clustered is a CLUSTER USING table in PostgreSQL

I am not entirely sure how much this information helps, but the system table pg_stats contains a correlation column:

select schemaname,tablename,attname,correlation 
from pg_stats 
where schemaname='public' 
order by correlation;

From the manual

Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a < operator.)

This value is on a per-column basis, so my guess is, that if the correlation for the first index column is near zero, the table is "un-clustered", if near -1 or 1 it is closer to the order of the index and thus "more clustered". This if course can only work for B-Tree indexes that are defined as ASC (I guess).

Best Answer

Related Solutions

MySQL 5.5.8 InnoDB Foreign Key, JOIN performance

Postgresql – How un-clustered is a CLUSTER USING table in PostgreSQL

Related Question