PostgreSQL – Clustering Rows Without Exclusive Lock and Logging Overhead

clusteringpostgresqlpostgresql-9.3

The CLUSTER command on a big table can take a long time, and blocks both reads and writes to the table while it runs.

I don't need the data in my table to be strictly sorted in index order, I just want rows that are commonly queried together to be more likely to be in the same database blocks than scattered evenly through the table (which is the distribution they naturally have due to the nature of the way date is inserted to the table).

It can make a big difference. In the example below, the only difference is that the insert has an additional order by mod(g,10) so that the test data is pre-clustered by host_id. Far fewer blocks need to be read when getting all the data for one particular host_id.

Is there some way of achieving this kind of clustering without the exclusive lock and logging overhead of the cluster command?

create schema stack;
set search_path=stack;
--
create table foo(host_id integer, bar text default repeat('a',400));
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g;
create index nu_foo on foo(host_id);
explain analyze select count(bar) from foo where host_id=1;
/*
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=30188.66..30188.67 rows=1 width=404) (actual time=1129.858..1129.859 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=919.27..30066.46 rows=48883 width=404) (actual time=253.149..1110.013 rows=50000 loops=1)
         Recheck Cond: (host_id = 1)
         Rows Removed by Index Recheck: 320257
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..907.04 rows=48883 width=0) (actual time=251.863..251.863 rows=50000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 1129.893 ms
*/
--
drop table foo;
--
create table foo(host_id integer, bar text default repeat('a',400));
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g order by mod(g,10);
create index nu_foo on foo(host_id);
explain analyze select count(bar) from foo where host_id=1;
/*
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=7550.20..7550.21 rows=1 width=32) (actual time=24.397..24.397 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=47.80..7543.95 rows=2500 width=32) (actual time=3.988..16.189 rows=50000 loops=1)
         Recheck Cond: (host_id = 1)
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..47.17 rows=2500 width=0) (actual time=3.649..3.649 rows=50000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 24.437 ms
*/
--
drop schema stack cascade;

Best Answer

You can do this without using the cluster command and having the table locked or generating WAL for the whole table. The cost is that you need to full-scan the table regularly.

The basic idea is:

  1. turn off autovacuum for the table
  2. check each block to determine the degree of clustering
  3. delete and re-insert all the rows from blocks below a clustering threshold
  4. manually vacuum to free those (complete) blocks
  5. repeat steps 2-4 as regularly as necessary

test schema sample data initially 'part-clustered':

create schema stack;
set search_path=stack;
create type t_tid as (blkno bigint, rowno integer);
create table foo(host_id integer, bar text default repeat('a',400)) with (autovacuum_enabled=false);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g order by mod(g,10);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g;
create index nu_foo on foo(host_id);

initial clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 27769  <---- half clustered
  2 |     8
  5 |     1
 10 | 27778  <---- half un-clustered
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
 30558  <--------- lots of blocks to read for `host_id=1`
*/

initial analyze (2146.503 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15097.30..15097.31 rows=1 width=32) (actual time=2146.157..2146.158 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=95.17..15084.80 rows=5000 width=32) (actual time=21.586..2092.379 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         Rows Removed by Index Recheck: 286610
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..93.92 rows=5000 width=0) (actual time=19.232..19.232 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 2146.503 ms
*/

delete and re-insert the un-clustered rows:

with w as ( select blkno
            from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
            group by blkno
            having count(*)>2 )
   , d as ( delete from foo
            where (ctid::text::t_tid).blkno in (select blkno from w)
            returning * )
insert into foo(host_id,bar) select host_id,bar from d order by host_id;
--
vacuum foo;

new clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 55541  <---- fully clustered
  2 |    16
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
  5558  <--------- far fewer blocks to read for `host_id=1`
*/

new analyze (48.804 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16110.64..16110.65 rows=1 width=32) (actual time=48.760..48.761 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=131.18..16098.14 rows=5000 width=32) (actual time=8.402..32.439 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..129.93 rows=5000 width=0) (actual time=7.636..7.636 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 48.804 ms
*/

clean up:

drop schema stack cascade;

The above is workable now, but is a bit quirky (needing to turn off auto-vacuum for the table) and requires regular full-scanning the table. I think something similar without the disadvantages could built into postgres. You'd need:

  1. A space efficient index to cluster on (this is coming in 9.4 with GIN compression, or better still in 9.5 with the new BRIN index type)
  2. A 'vacuum-like' process that would scan that index to detect which blocks need to be deleted/reinserted (this would ideally be able to reinsert the rows into fresh blocks so auto-vacuum can be left at default)