PostgreSQL – Clustered Index vs. Table Ordered by Column

clustered-indexperformancepostgresqlpostgresql-performance

Reading the official PostgreSQL documentation for version 9.0 I read an interesting escamotage that performs better than CLUSTER for big tables:

The CLUSTER command reorders the original table by scanning it using the index you specify. This can be slow on large tables because the rows are fetched from the table in index order, and if the table is disordered, the entries are on random pages, so there is one disk page retrieved for every row moved. (PostgreSQL has a cache, but the majority of a big table will not fit in the cache.) The other way to cluster a table is to use:
  CREATE TABLE newtable AS
    SELECT * FROM table ORDER BY columnlist;
which uses the PostgreSQL sorting code to produce the desired order; this is usually much faster than an index scan for disordered data. Then you drop the old table, use ALTER TABLE … RENAME to rename newtable to the old name, and recreate the table's indexes. The big disadvantage of this approach is that it does not preserve OIDs, constraints, foreign key relationships, granted privileges, and other ancillary properties of the table — all such items must be manually recreated. Another disadvantage is that this way requires a sort temporary file about the same size as the table itself, so peak disk usage is about three times the table size instead of twice the table size.

The problem is that this suggestion doesn't appear in > 9.0 versions of the official documentation.

My question is if this escamotage is still valid for 9.1, 9.2, 9.3 and 9.4 because I'm stuck with a CLUSTER operation over two big tables (one has ~750M rows and the other one has ~1650M rows) and average disk write/read speed is 3MB/s due to the CLUSTER algorithm explained in the official doc. It's a slow process over big tables, so I'd like to avoid it doing the "create ordered table over index-associated-column" trick. This will save me days of DB processing.

Best Answer

Like @dezso commented, creating a new table and dropping the old used to be faster in old versions, but not any more with the new implementation in pg 9.1.

The most common problem with CLUSTER is that it requires an exclusive lock on the table, which does not go well with concurrent access to it.

The solution to this problem is pg_repack, which does not lock the table exclusively.

Generally, make sure that your server configuration is fit for the task. High settings (a lot of RAM) for maintenance_work_mem would help both CLUSTER and CREATE INDEX on big tables. Standard setting is way too small for you. Follow the links for details.

You might set it very high temporarily for a transaction with SET LOCAL and leave it at a reasonable setting otherwise:

BEGIN;
SET LOCAL maintenance_work_mem = ????MB; -- find the sweet spot
CLUSTER tbl;
COMMIT;

If possible, set it high enough to fit the whole operation in RAM.

PostgreSQL – Index Caching for Performance Improvement

Playing a bit with pg_buffercache, I could get answers to some of your questions.

This is quite obvious, but the results for (5) also show that answer is YES
I am yet to set up a good example for this, for now it is more yes than no :) (See my edit below, the answer is NO.)
Since the planner is who decides whether to use an index or not, we can say YES, it decides caching (but this is more complicated)
The exact details of caching could be derived from the source code, I couldn't find too much on this topic, except this one (see the author's answer, too). However, I'm pretty sure that this again is far more complicated than a simple yes or no. (Again, from my edit you can get some idea - since the cache size is limited, those 'sensible' indexes compete for available space. If they are too many, they will kick each other from cache - so the answer is rather NO.)
As a simple query with pg_buffercache shows, the answer is a definitive YES. It is worth to note that temporary table data don't get cached here.

EDIT

I've found Jeremiah Peschka's terrific article about table and index storage. With information from there, I could answer (2) as well. I set up a small test, so you can check these yourself.

-- we will need two extensions
CREATE EXTENSION pg_buffercache;
CREATE EXTENSION pageinspect;


-- a very simple test table
CREATE TABLE index_cache_test (
      id serial
    , blah text
);


-- I am a bit megalomaniac here, but I will use this for other purposes as well
INSERT INTO index_cache_test
SELECT i, i::text || 'a'
FROM generate_series(1, 1000000) a(i);


-- let's create the index to be cached
CREATE INDEX idx_cache_test ON index_cache_test (id);


-- now we can have a look at what is cached
SELECT c.relname,count(*) AS buffers
FROM 
    pg_class c 
    INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode 
    INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname
ORDER BY 2 DESC LIMIT 10;

             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2747
 pg_statistic_relid_att_inh_index |       4
 pg_operator_oprname_l_r_n_index  |       4
... (others are all pg_something, which are not interesting now)

-- this shows that the whole table is cached and our index is not in use yet

-- now we can check which row is where in our index
-- in the ctid column, the first number shows the page, so 
-- all rows starting with the same number are stored in the same page
SELECT * FROM bt_page_items('idx_cache_test', 1);

 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,164) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,5)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
...
         64 | (0,63)  |      16 | f     | f    | 3f 00 00 00 00 00 00 00
         65 | (0,64)  |      16 | f     | f    | 40 00 00 00 00 00 00 00

-- with the information obtained, we can write a query which is supposed to
-- touch only a single page of the index
EXPLAIN (ANALYZE, BUFFERS) 
    SELECT id 
    FROM index_cache_test 
    WHERE id BETWEEN 10 AND 20 ORDER BY id
;

 Index Scan using idx_test_cache on index_cache_test  (cost=0.00..8.54 rows=9 width=4) (actual time=0.031..0.042 rows=11 loops=1)
   Index Cond: ((id >= 10) AND (id <= 20))
   Buffers: shared hit=4
 Total runtime: 0.094 ms
(4 rows)

-- let's have a look at the cache again (the query remains the same as above)
             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2747
 idx_test_cache                   |       4
...

-- and compare it to a bigger index scan:
EXPLAIN (ANALYZE, BUFFERS) 
SELECT id 
    FROM index_cache_test 
    WHERE id <= 20000 ORDER BY id
;


 Index Scan using idx_test_cache on index_cache_test  (cost=0.00..666.43 rows=19490 width=4) (actual time=0.072..19.921 rows=20000 loops=1)
   Index Cond: (id <= 20000)
   Buffers: shared hit=4 read=162
 Total runtime: 24.967 ms
(4 rows)

-- this already shows that something was in the cache and further pages were read from disk
-- but to be sure, a final glance at cache contents:

             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2691
 idx_test_cache                   |      58

-- note that some of the table pages are disappeared
-- but, more importantly, a bigger part of our index is now cached

All in all, this shows that indexes and tables can be cached page by page, therefore the answer for (2) is NO.

And a final one to illustrate temporary tables being non-cached here:

CREATE TEMPORARY TABLE tmp_cache_test AS 
SELECT * FROM index_cache_test ORDER BY id FETCH FIRST 20000 ROWS ONLY;

EXPLAIN (ANALYZE, BUFFERS) SELECT id FROM tmp_cache_test ORDER BY id;

-- checking the buffer cache now shows no sign of the temp table

SQL Server – Clustered Index, Index Balancing, and Insert Performance Using NewID

Are there any reasons for periodic/cyclic slow-down of insert performance?

Yes. check point events. With a write intensive workload, big RAM server, as you describe, a large number of 'dirty' pages accumulate in memory. At the predetermined checkpoint interval all these dirty pages get written to disk, causing a spike of IO requests. This in turn slows down the log commit writes, which manifests as the increase in INSERT response time you observe periodically. QED. This is, of course, just a guess, in lack of a proper investigation. For a more certain response, I recommend you read How to analyse SQL Server performance and apply the techniques described there to identify the problem.

If the problem is indeed caused by checkpoint, then SQL Server 2012 comes with Indirect Checkpoints:

Indirect checkpoints, new in SQL Server 2012, provide a configurable database-level alternative to automatic checkpoints. ... Indirect checkpoints reduce checkpoint-related I/O spiking by continually writing dirty pages to disk in the background.

For a more detailed discussion about chekcpoint impact on performance read SQL Q&A: Fine Tuning for Optimal Performance:

In Search of Spikes
Q. I’m troubleshooting an issue where we see periodic I/O spikes from one of our SQL Servers. I’ve narrowed it down to checkpoints using PerfMon, but I can’t tell which database is the major culprit. How can I drill in further?

Pre-SQL Server 2012 you have the option to reduce the recovery interval value. This will increase the frequency of checkpoints, but will reduce the number of dirty pages each checkpoint has to write. Spreading out the data IO helps (buy more spindles). Separating the log IO to it's own path (own spindle) does not help the checkpoint, but isolates the log commits from the effects and thus keep the INSERT responsive. SSDs work miracles.

I would advice against any structural changes. In my opinion you already have the best clustered index for time series. Any structural change would have to be backed by root-cause- performance analysis pointing to the current structure as a problem.

Best Answer

Related Solutions

PostgreSQL – Index Caching for Performance Improvement

EDIT

SQL Server – Clustered Index, Index Balancing, and Insert Performance Using NewID

Related Question