Postgresql – Is clustering a table beneficial, given this update pattern

database-designperformancepostgresqlpostgresql-performanceupdate

I have a table essentially equivalent to this example:

create table my_table(i integer, x text, y integer, z integer);

create index idx_i_x on my_table(i, x);
create index idx_i_y on my_table(i, y);
create index idx_i_z on my_table(i, z);
create index idx_i_x on my_table(i, x, y, z);

All queries have column i in the where clause.

It seems that this table is a good candidate to benefit from
table clustering (https://www.postgresql.org/docs/9.6/static/sql-cluster.html)

Stats on my_table: for each value of i, there will be from 5 to 10 million rows.

Updates on the table follow the pattern:

update my_table set i = 4 where i = 0;

delete from my_table where i = 6;

All updates or deletes are "bulk", i.e. they
involve all rows for a given value of i.

Given what is written in the docs (https://www.postgresql.org/docs/9.6/static/sql-cluster.html) :

Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.

it seems as if it doesn't matter that rows are not reclustered for updates of i, since when I execute this update (for example) :

update my_table set i = 4 where i = 0;

all rows that had i = 0 are still clustered together.

Best Answer

CLUSTER helps read performance where actual table rows have to be read (from the main relation). Index-only scans don't care about the physical order of rows in the underlying table. If your table layout is that simple, you might cover almost all SELECT queries with index-only scans (Might require aggressive autovacuum settings for the table to support that.)

UPDATE performance may even suffer from CLUSTER. First of all, CLUSTER rewrites the table without dead tuples, thereby removing "wiggle-room" for H.O.T. updates. A lower FILLFACTOR might help, but not in your case. While updating indexed columns, H.O.T. updates are not possible anyway.

While you update millions of rows in one command (or transaction), clustered together, new row versions will hardly ever find room on the same data page, which makes it a bit more expensive to begin with.

And since you have multiple indexes including the updated columns, your updates are particularly expensive, since all of these indexes need to be updated as well.

Dead rows have to be reclaimed eventually. So new rows are not always written "en bloc", which inevitably leads to (slow in your case) fragmentation over time.

More efficient approach?

for each value of i, there will be from 5 to 10 million rows.
...
All updates or deletes are "bulk", i.e. they involve all rows for a given value of i.

Seems like you should not be updating millions of rows to begin with. Instead, create a second table (1:n) with a single row for every distinct i in my_table:

CREATE TABLE current_i (
   org_i     integer PRIMARY KEY
 , current_i integer NOT NULL  -- UNIQUE?
);

Should be a very small table, given your specifications. Now you only update a single row in current_i instead of millions in my_table, saving a lot of bloat and vacuuming in my_table and its indexes - in addition to the much faster updates - which should make everything else faster, too.

To enforce referential integrity, you may want to add a FOREIGN KEY constraint to my_table.i:

REFERENCES current_i(i)

In queries, just join to current_i. You might provide a VIEW as drop-in replacement for the current table in SELECT queries:

CREATE VIEW my_view AS
SELECT i.current_i AS i, m.x, m.y, m.z
FROM   my_table  m
JOIN   current_i i ON i.org_i = m.i;

Should be much faster overall. All of this may come down to a simple case of normalization.