I have a table essentially equivalent to this example:
create table my_table(i integer, x text, y integer, z integer);
create index idx_i_x on my_table(i, x);
create index idx_i_y on my_table(i, y);
create index idx_i_z on my_table(i, z);
create index idx_i_x on my_table(i, x, y, z);
All queries have column i
in the where clause.
It seems that this table is a good candidate to benefit from
table clustering (https://www.postgresql.org/docs/9.6/static/sql-cluster.html)
Stats on my_table: for each value of i
, there will be from 5 to 10 million rows.
Updates on the table follow the pattern:
update my_table set i = 4 where i = 0;
delete from my_table where i = 6;
All updates or deletes are "bulk", i.e. they
involve all rows for a given value of i.
Given what is written in the docs (https://www.postgresql.org/docs/9.6/static/sql-cluster.html) :
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
it seems as if it doesn't matter that rows are not reclustered for updates of i, since when I execute this update (for example) :
update my_table set i = 4 where i = 0;
all rows that had i = 0
are still clustered together.
Best Answer
CLUSTER
helps read performance where actual table rows have to be read (from the main relation). Index-only scans don't care about the physical order of rows in the underlying table. If your table layout is that simple, you might cover almost allSELECT
queries with index-only scans (Might require aggressive autovacuum settings for the table to support that.)UPDATE
performance may even suffer fromCLUSTER
. First of all,CLUSTER
rewrites the table without dead tuples, thereby removing "wiggle-room" for H.O.T. updates. A lowerFILLFACTOR
might help, but not in your case. While updating indexed columns, H.O.T. updates are not possible anyway.While you update millions of rows in one command (or transaction), clustered together, new row versions will hardly ever find room on the same data page, which makes it a bit more expensive to begin with.
And since you have multiple indexes including the updated columns, your updates are particularly expensive, since all of these indexes need to be updated as well.
Dead rows have to be reclaimed eventually. So new rows are not always written "en bloc", which inevitably leads to (slow in your case) fragmentation over time.
More efficient approach?
Seems like you should not be updating millions of rows to begin with. Instead, create a second table (1:n) with a single row for every distinct
i
inmy_table
:Should be a very small table, given your specifications. Now you only update a single row in
current_i
instead of millions inmy_table
, saving a lot of bloat and vacuuming inmy_table
and its indexes - in addition to the much faster updates - which should make everything else faster, too.To enforce referential integrity, you may want to add a
FOREIGN KEY
constraint tomy_table.i
:In queries, just join to
current_i
. You might provide aVIEW
as drop-in replacement for the current table inSELECT
queries:Should be much faster overall. All of this may come down to a simple case of normalization.