Postgresql – Huge mismatch between reported index size and number of buffers in execution plan

execution-planindexpostgresqlpostgresql-9.3

The issue

We have a query like

SELECT COUNT(1) 
  FROM article
  JOIN reservation ON a_id = r_article_id 
 WHERE r_last_modified < now() - '8 weeks'::interval 
   AND r_group_id = 1 
   AND r_status = 'OPEN';

As it runs into a timeout (after 10 minutes) more often than not, I decided to investigate the issue.

The EXPLAIN (ANALYZE, BUFFERS) output looks like this:

 Aggregate  (cost=264775.48..264775.49 rows=1 width=0) (actual time=238960.290..238960.291 rows=1 loops=1)
   Buffers: shared hit=200483 read=64361 dirtied=666 written=8, temp read=3631 written=3617
   I/O Timings: read=169806.955 write=0.154
   ->  Hash Join  (cost=52413.67..264647.65 rows=51130 width=0) (actual time=1845.483..238957.588 rows=21644 loops=1)
         Hash Cond: (reservation.r_article_id = article.a_id)
         Buffers: shared hit=200483 read=64361 dirtied=666 written=8, temp read=3631 written=3617
         I/O Timings: read=169806.955 write=0.154
         ->  Index Scan using reservation_r_article_id_idx1 on reservation  (cost=0.42..205458.72 rows=51130 width=4) (actual time=34.035..237000.197 rows=21644 loops=1)
               Filter: ((r_group_id = 1) AND (r_status = 'OPEN') AND (r_last_modified < (now() - '56 days'::interval)))
               Rows Removed by Filter: 151549
               Buffers: shared hit=200193 read=48853 dirtied=450 written=8
               I/O Timings: read=168614.105 write=0.154
         ->  Hash  (cost=29662.22..29662.22 rows=1386722 width=4) (actual time=1749.392..1749.392 rows=1386814 loops=1)
               Buckets: 32768  Batches: 8  Memory Usage: 6109kB
               Buffers: shared hit=287 read=15508 dirtied=216, temp written=3551
               I/O Timings: read=1192.850
               ->  Seq Scan on article  (cost=0.00..29662.22 rows=1386722 width=4) (actual time=23.822..1439.310 rows=1386814 loops=1)
                     Buffers: shared hit=287 read=15508 dirtied=216
                     I/O Timings: read=1192.850
 Total runtime: 238961.812 ms

The bottleneck node is obviously the index scan. So let's see the index definition:

CREATE INDEX reservation_r_article_id_idx1 
    ON reservation USING btree (r_article_id)
 WHERE (r_status <> ALL (ARRAY['FULFILLED', 'CLOSED', 'CANCELED']));

Sizes and row numbers

It's size (reported by \di+ or by visiting the physical file) is 36 MB. As the reservations usually spend only a relatively short time in all the statuses not listed above, there is a lot of updating happening, so the index is quite bloated (around 24 MB is wasted here) – still, the size is relatively small.

The reservation table is about 3.8 GB in size, containing around 40 million rows. The number of reservations that are not yet closed is about 170,000 (the exact number is reported in the index scan node above).

Now the surprise: the index scan reports fetching a huge amounts of buffers (that is, 8 kb pages):

Buffers: shared hit=200193 read=48853 dirtied=450 written=8

The numbers read from cache and disk (or the OS cache) add up to 1.9 GB!

Worst-case scenario

On the other hand, the worst-case scenario, when every tuple sits on a different page of the table, would account for visiting (21644 + 151549) + 4608 pages (total rows fetched from the table plus the index page number from the physical size). This is still only under 180,000 – far below the observed nearly 250,000.

Interesting (and maybe important) is that the disk read speed is around 2.2 MB/s, which is quite normal, I guess.

So what?

Does anyone have an idea about where this discrepancy can come from?

Note: To be clear, we have ideas what to improve/change here, but I really would like to understand the numbers I got – this is what the question is about.

Update: checking the effect of caching or microvacuuming

Based on jjanes's answer, I've checked what happens when I re-run the exact same query straight away. The number of affected buffers does not really change. (To do this, I simplified the query to its bare minimum that still shows the issue.) This is what I see from the first run:

 Aggregate  (cost=240541.52..240541.53 rows=1 width=0) (actual time=97703.589..97703.590 rows=1 loops=1)
   Buffers: shared hit=413981 read=46977 dirtied=56
   I/O Timings: read=96807.444
   ->  Index Scan using reservation_r_article_id_idx1 on reservation  (cost=0.42..240380.54 rows=64392 width=0) (actual time=13.757..97698.461 rows=19236 loops=1)
         Filter: ((r_group_id = 1) AND (r_status = 'OPEN') AND (r_last_modified < (now() - '56 days'::interval)))
         Rows Removed by Filter: 232481
         Buffers: shared hit=413981 read=46977 dirtied=56
         I/O Timings: read=96807.444
 Total runtime: 97703.694 ms

and after the second one:

 Aggregate  (cost=240543.26..240543.27 rows=1 width=0) (actual time=388.123..388.124 rows=1 loops=1)
   Buffers: shared hit=460990
   ->  Index Scan using reservation_r_article_id_idx1 on reservation  (cost=0.42..240382.28 rows=64392 width=0) (actual time=0.032..385.900 rows=19236 loops=1)
         Filter: ((r_group_id = 1) AND (r_status = 'OPEN') AND (r_last_modified < (now() - '56 days'::interval)))
         Rows Removed by Filter: 232584
         Buffers: shared hit=460990
 Total runtime: 388.187 ms

Best Answer

I think that the key here is the lot of updates, and the bloat on the index.

The index contains pointers to rows in the table that are no longer 'live'. These are the old versions of updated rows. The old row versions are kept around for a while, to satisfy queries with an old snapshot, and then kept around for a while more because no one want to do the work of removing them more often than necessary.

When scanning the index, it has to go visit these rows, and then notices they are no longer visible, so ignores them. The explain (analyze,buffers) statement does not report on this activity explicitly, except through the counting of the buffers read/hit in the process of inspecting these rows.

There is some "microvacuum" code for btrees, such that when the scan gets back to the index again, it remembers that the pointer it chased down was no longer live, and marks it as dead in the index. That way the next similar query that gets run doesn't need to chase it down again. So if you run the exact same query again, you will probably see the buffer accesses drop closer to what you predicted.

You can also VACUUM the table more often, that will clean the dead tuples out of the table itself, not just out of the partial index. In general, tables with a high-turn-over partial index are likely to benefit from more aggressive vacuum than the default level.