PostgreSQL Performance – SELECT Very Slow When No Results and a LIMIT is Specified

indexperformancepostgresqlpostgresql-9.4postgresql-performancestatistics

I am running into an issue where a SELECT query is very slow because it does not use an index when the number of final results is 0 and a LIMIT clause is specified.

If the number of results is greater than 0 then Postgres uses the index and the result return in ~ 1ms. This seems to always be true as far as I can find.

If the number of results is 0 and no LIMIT is used, Postgres uses the index and the results return in ~ 1ms

If the number of results is 0 and a LIMIT is specified Postgres does a sequential scan and the results take ~ 13,000 ms.

Why doesn't PostgreSQL use the index in the last case?

Cardinalities:

~ 21 million rows total.
~ 3 million rows WHERE related_id=1
~ 3 million rows WHERE related_id=1 AND platform=p1
2 rows WHERE related_id=1 AND platform=p2
0 rows WHERE related_id=1 AND platform=p3
~ 8 million rows WHERE platform=p2

Postgres Version: 9.4.6

Table Schema:

CREATE TYPE platforms AS ENUM ('p1', 'p2', 'p3');

CREATE TABLE mytable (
    id bigint NOT NULL DEFAULT nextval('mytable_sq'::regclass),
    related_id integer NOT NULL,
    platform platforms NOT NULL DEFAULT 'default'::platforms,
    name character varying(200) NOT NULL,
    CONSTRAINT mytable_pkey PRIMARY KEY (id),
    CONSTRAINT mytable_related_id_fkey FOREIGN KEY (related_id)
         REFERENCES related (id)
);

CREATE INDEX related_id__platform__index ON mytable (related_id, platform);
CREATE UNIQUE INDEX some_other_index ON mytable (related_id, lower(name::text));

Queries and plans:

This query returns 0 rows:

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms
LIMIT 20;

 Limit  (cost=0.00..14.07 rows=20 width=737) (actual time=12863.465..12863.465 rows=0 loops=1)
    ->  Seq Scan on mytable  (cost=0.00..1492790.47 rows=2122653 width=737) (actual time=12863.464..12863.464 rows=0 loops=1)
          Filter: ((related_id = 1) AND (platform = 'p2'::platforms))
          Rows Removed by Filter: 21076656
 Planning time: 3.540 ms
 Execution time: 12868.190 ms

This query also returns 0 rows:

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms;

 Bitmap Heap Scan on mytable  (cost=60533.63..1295799.94 rows=2122653 width=737) (actual time=0.890..0.890 rows=0 loops=1)
 Recheck Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
  ->  Bitmap Index Scan on related_id__platform__index  (cost=0.00..60002.97 rows=2122653 width=0) (actual time=0.888..0.888 rows=0 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
 Planning time: 0.827 ms
 Execution time: 1.104 ms

This query returns 20 rows (without LIMIT it would be upwards of 2 million rows):

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p1'::platforms
LIMIT 20;

 Limit  (cost=0.44..70.95 rows=20 width=737) (actual time=0.759..0.995 rows=20 loops=1)
   ->  Index Scan using related_id__platform__index on mytable  (cost=0.44..1217669.26 rows=345388 width=737) (actual time=0.759..0.993 rows=20 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p1'::platforms))
 Planning time: 5.776 ms
 Execution time: 2.476 ms

This query returns 2 rows:

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p3'::platforms LIMIT 20;

 Limit  (cost=0.44..80.37 rows=20 width=737) (actual time=0.014..0.016 rows=2 loops=1)
   ->  Index Scan using related_id__platform__index on mytable  (cost=0.44..99497.62 rows=24894 width=737) (actual time=0.014..0.015 rows=2 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p3'::platforms))
 Planning time: 0.972 ms
 Execution time: 0.123 ms

Best Answer

Postgres does a bad job estimating the frequency of the combination of predicates in your query:

SELECT * FROM tbl
WHERE  related_id = 1 AND platform = 'p2'::platforms
LIMIT  20;

Each of your predicates is not very selective on its own - that information is available to Postgres ("most common values") - assuming your statistics are up to date:

~ 21 million rows total.
~ 3 million rows WHERE related_id=1
~ 8 million rows WHERE platform=p2

IOW, ~ every 7th row passes the 1st filter and ~ every 3rd row passes the 2nd. Postgres does the (naive) math and expects that roughly every 20th row qualifies. Since there is no ORDER BY, any 20 qualifying rows will do. The fastest way should be to scan the table sequentially and be done after ~ 400 rows - only a few data pages, very cheap.

Using any indexes adds some extra cost, Postgres needs to scan index and table. (Exception: index-only scans, which are not possible here for your SELECT *). That will only pay if Postgres would otherwise have to read enough additional pages to estimate that to be more expensive. That's how I would explain that you see a sequential scan for a small LIMIT, but a bitmap index scan for a big (or no) LIMIT.

Unfortunately, the combination of your predicates turns out to be unexpectedly rare. Postgres has to scan the whole table just to find only 2 qualifying rows. (The index would actually be much cheaper in any case.)

2 rows WHERE related_id=1 AND platform=p2

The combined frequency of values in multiple columns is not available to Postgres. Think about it: gathering statistics like that would quickly get out of hand.

A very simple and efficient solution for this particular case: Create a partial index:

CREATE INDEX related_id_1_platform_2_idx ON tbl (id)
WHERE  related_id = 1 AND platform = 'p2'::platforms;

Not only will this super-tiny index (2 rows) be a perfect match for your query, it will also make a count estimate available for your particular combination (entry in pg_class.reltuples). The actual index column is mostly irrelevant for this one, pick a small column, typically best to make it the PK.

If one of the two predicates can change, there is a more generic approach. Say, if related_id = 1 is the stable condition, then create:

CREATE INDEX related_id_1_idx ON tbl (platform)
WHERE  related_id = 1;

The index column is relevant again. This may not be enough to tip the scales because Postgres only collects complete statistics for index columns of functional indexes (else it relies on the statistics of the underlying table). I propose:

CREATE INDEX related_id_1_func_idx ON tbl ((platform::text::platforms))  -- double parens!
WHERE  related_id = 1;

Note the extra pair of parentheses - a syntactical necessity for the cast shorthand.
The expression platform::text::platforms doesn't actually change anything - it casts your enum to text and back. But it makes Postgres collect full statistics on the (presumed) new values.

Now, (after ANALYZE tbl) we have full statistics including most common values of platform for related_id = 1.

Check with:

SELECT *
FROM   pg_stats
WHERE  schemaname = 'public'  -- actual schema
AND    tablename  = 'related_id_1_func_idx';  -- actual idx name

And Postgres should choose the index for your case - if you repeat the same expression in your query. So:

SELECT ...
WHERE related_id = 1 AND platform::text::platforms = 'p2'::platforms;

Index that is not used, yet influences query

About most common values in the Postgres statistics:

Speeding up a GROUP BY, HAVING COUNT query

Best Answer

Related Solutions

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question