Postgresql – Very slow IndexOnlyScan on partial ~16MiB indexes

postgresql

I have partitioned table by months and recently added partial, covering index to get certain data really fast. It works on one pair of servers (master+slave) that hold portion of data (including whole 2016 year), but it does not work as expected on second pair that holds 2017 and 2015.

What I would like to know is why it is so slow and how could I fix this.

I have to use pg_hint extension in order to force IndexOnlyScan for the slowest plan, otherwise it decides that doing Bitmap Heap Scan will be faster (it is not wrong). I also did try with set enable_bitmapscan = off; for the same effect.

Those test are made on live servers and clearing cache/buffers might be tricky, but possible if necessary. Both master and slave behave in similar way.

Most notable difference in config is that faster pair has hot_standby_feedback turned off and is 9.6.5, while slower is 9.6.2. But AFAIK that would only mean more bloat on slower pair. I did not notice any mention of btree IndexOnlyScan fixes between 9.6.2 and 9.6.5, but I could be barking at wrong tree here?

My coworker mentioned that maybe it is checking visibility map to see if tuples in index are still live or not, but would that explain queries to take 30-70 minutes on a 17MiB index? About 10-15% are dead tuples on those partitions and partial index covers about 1% of total rows.

Index looks like this:

CREATE INDEX "partition_201701_covering_idx"
  ON "partition_201701"
  USING btree
  (col1, col2, col3)
  WHERE orpr > 0::numeric;

Stats:

         tablename          |                      indexname                       |  num_rows   | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched | indexprs |        indpred         | indam
----------------------------+------------------------------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------+----------+------------------------+-------
 "partition_201701"         | "partition_201701_covering_idx"                      | 5.64915e+07 | 12 GB      | 17 MB      | N      |               2 |      352450 |         352446 |          | (orpr > (0)::numeric)  | btree
 "partition_201506"         | "partition_201506_covering_idx"                      | 4.0186e+07  | 8851 MB    | 16 MB      | N      |               1 |      347222 |          11370 |          | (orpr > (0)::numeric)  | btree
 "partition_201601"         | "partition_201601_covering_idx"                      | 4.81767e+07 | 11 GB      | 15 MB      | N      |               1 |      318277 |            347 |          | (orpr > (0)::numeric)  | btree

EXPLAIN (analyze, buffers):

QUERY PLAN
Index Only Scan using "partition_201701_covering_idx" on "partition_201701" v  (cost=0.42..1309538.47 rows=359624 width=16) (actual time=14.866..4535824.075 rows=352133 loops=1)
  Index Cond: ((timestmp >= '2017-01-01 00:00:00'::timestamp without time zone) AND (timestmp <= '2017-01-31 00:00:00'::timestamp without time zone))
  Heap Fetches: 352137
  Buffers: shared hit=64485 read=239413 written=12
Planning time: 0.181 ms
Execution time: 4536208.657 ms

QUERY PLAN
Index Only Scan using "partition_201506_covering_idx" on "partition_201506" v  (cost=0.42..105239.14 rows=316903 width=16) (actual time=26.633..25690.807 rows=347219 loops=1)
  Index Cond: ((timestmp >= '2015-06-01 00:00:00'::timestamp without time zone) AND (timestmp <= '2015-06-30 00:00:00'::timestamp without time zone))
  Heap Fetches: 11373
  Buffers: shared hit=61890 read=9222 dirtied=6178
Planning time: 27.566 ms
Execution time: 25909.181 ms

QUERY PLAN
Index Only Scan using "partition_201601_covering_idx" on "partition_201601" v  (cost=0.42..14977.06 rows=403037 width=16) (actual time=29.858..203.586 rows=318277 loops=1)
  Index Cond: ((timestmp >= '2016-01-01 00:00:00'::timestamp without time zone) AND (timestmp <= '2016-01-31 00:00:00'::timestamp without time zone))
  Heap Fetches: 347
  Buffers: shared hit=101701 read=1966
Planning time: 147.848 ms
Execution time: 216.116 ms

Best Answer

If 15% of the table rows are dead, and those dead rows are concentrated in the rows which meet the partial index conditions, then that could be up to 1500% bloat in the index itself. (For this reason, tables with partial indexes often need to be vacuumed much more aggressively than the default settings). But, it might not be necessary to use this high bloat to explain the slowness, because the lack of visibility map setting might be sufficient to explain it.

My coworker mentioned that maybe it is checking visibility map to see if tuples in index are still live or not, but would that explain queries to take 30-70 minutes on a 17MiB index?

Yes. 17MiB of index can reference gigabytes of table, and since it has to check that table for visibility, that means it can be thrashing through gigabytes of table data in random order being driven off of 17MiB of index.

Looking at your slowest plan, if all 239413 buffer misses had to be read from disk and you have 7200 rpm disks, that alone would explain half your execution time. If there were several such queries running at the same time competing for disk access, it could explain all of the execution time.

While it is technically using an index only scan, you are getting basically zero benefit from this because of the visibility map issue.

Either the table isn't being vacuumed enough, or the effects of hot_standby_feedback are rendering the vacuuming largely ineffective. (Or possibly the table has such a high churn rate that trying to keep the visibility map topped up is a hopeless task, and you need to give up on index-only-scans being a useful tool for you on that partition)