Postgresql – Postgres sometimes uses inferior index for WHERE a IN (…) ORDER BY b LIMIT N

index-tuningpagingperformancepostgresqlpostgresql-9.6query-performance

We have a PostgreSQL table with ~5 billion rows that has developed a nasty habit of missing the proper indices and doing a Primary Key scan on certain LIMIT operations.

The problem generally manifests on an ORDER BY .. LIMIT .. clause (a common pattern in Django pagination) where the LIMIT is some relatively small subset of the results matched by the index. An extreme example is this:

SELECT * FROM mcqueen_base_imagemeta2 
  WHERE image_id IN ( 123, ... )
  ORDER BY id DESC
  LIMIT 1;

where the items in that IN clause are ~20 and total rows matched by the index on image_id is 16.

The EXPLAIN shows that it misses the image_id index and instead does a PK scan of 5B rows:

Limit  (cost=0.58..4632.03 rows=1 width=28)
   ->  Index Scan Backward using mcqueen_base_imagemeta2_pkey on mcqueen_base_imagemeta2  (cost=0.58..364597074.75 rows=78722 width=28)
         Filter: (image_id = ANY ('{123, ...}'::bigint[]))

If the LIMIT is increased to 2, it works as expected:

Limit  (cost=7585.92..7585.93 rows=2 width=28)
   ->  Sort  (cost=7585.92..7782.73 rows=78722 width=28)
         Sort Key: id DESC
         ->  Index Scan using mcqueen_base_imagemeta2_image_id_616fe89c on mcqueen_base_imagemeta2  (cost=0.58..6798.70 rows=78722 width=28)
               Index Cond: (image_id = ANY ('{123, ...}'::bigint[]))

This also happens on queries where the index matches ~3000 rows and the limit is set to 100, so something that easily happens in real world REST API pagination.

The table definition is:

mcqueen=# \d mcqueen_base_imagemeta2
                                       Table "public.mcqueen_base_imagemeta2"
      Column       |           Type           |                              Modifiers                               
-------------------+--------------------------+----------------------------------------------------------------------
 id                | bigint                   | not null default nextval('mcqueen_base_imagemeta2_id_seq'::regclass)
 created_at        | timestamp with time zone | not null
 image_id          | bigint                   | not null
 key_id            | smallint                 | not null
 source_version_id | smallint                 | not null
Indexes:
    "mcqueen_base_imagemeta2_pkey" PRIMARY KEY, btree (id)
    "mcqueen_base_imagemeta2_image_id_616fe89c" btree (image_id)
    "mcqueen_base_imagemeta2_key_id_a4854581" btree (key_id)
    "mcqueen_base_imagemeta2_source_version_id_f9b0513e" btree (source_version_id)
Foreign-key constraints:
    "mcqueen_base_imageme_image_id_616fe89c_fk_mcqueen_b" FOREIGN KEY (image_id) REFERENCES mcqueen_base_image(id) DEFERRABLE INITIALLY DEFERRED
    "mcqueen_base_imageme_key_id_a4854581_fk_mcqueen_b" FOREIGN KEY (key_id) REFERENCES mcqueen_base_metakey(id) DEFERRABLE INITIALLY DEFERRED
    "mcqueen_base_imageme_source_version_id_f9b0513e_fk_mcqueen_b" FOREIGN KEY (source_version_id) REFERENCES mcqueen_base_metasourceversion(id) DEFERRABLE INITIALLY DEFERRED

I'm a novice at best when it comes to tuning, but I figure that the defaults for statistics are not up to that table's size and so it naively thinks that a PK scan is faster than an index scan.

Best Answer

It thinks it is going to find 78722, but it really finds 16, so that is going to lead to some bad plans.

When a value in the in-list is not present in the MCV list of the stats table, it guesses their frequency using the n_distinct value, which is probably way off (you didn't answer my question about that). The way it does this is to take the number of tuples not covered by the MCV frequency list and divides it by the number of distinct values not listed in the MCV list. So basically ntuples * (1-sum of MCF) / (n_distinct - length of MCF). This simplified formula ignores NULLs.

As @ErwinBrandstetter suggests, you might be able to improve the situation by increasing the size of the MCV list by increasing the statistics sample size. That might also increase the accuracy of the n_distinct estimate. But with 6 billions rows, it might not possible to increase the sample size by enough. Also, if image_id are clumped together with the duplicate values likely to occur in the same page, then the sampling method used by PostgreSQL is quite biased when it comes to computing n_distinct, and this is resistant to fixing by just increasing the sample size.

A simpler way to fix this may be to fix the n_distinct manually:

alter table mcqueen_base_imagemeta2 alter column image_id set (n_distinct=1000000000);
analyze mcqueen_base_imagemeta2;

This method doesn't increase the time or storage required by ANALYZE, the way increasing the sample size does, and is also more likely to succeed.