I am running into an issue where a SELECT
query is very slow because it does not use an index when the number of final results is 0 and a LIMIT
clause is specified.
If the number of results is greater than 0 then Postgres uses the index and the result return in ~ 1ms. This seems to always be true as far as I can find.
If the number of results is 0 and no LIMIT
is used, Postgres uses the index and the results return in ~ 1ms
If the number of results is 0 and a LIMIT
is specified Postgres does a sequential scan and the results take ~ 13,000 ms.
Why doesn't PostgreSQL use the index in the last case?
Cardinalities:
~ 21 million rows total.
~ 3 million rows WHERE related_id=1
~ 3 million rows WHERE related_id=1 AND platform=p1
2 rows WHERE related_id=1 AND platform=p2
0 rows WHERE related_id=1 AND platform=p3
~ 8 million rows WHERE platform=p2
Postgres Version: 9.4.6
Table Schema:
CREATE TYPE platforms AS ENUM ('p1', 'p2', 'p3');
CREATE TABLE mytable (
id bigint NOT NULL DEFAULT nextval('mytable_sq'::regclass),
related_id integer NOT NULL,
platform platforms NOT NULL DEFAULT 'default'::platforms,
name character varying(200) NOT NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id),
CONSTRAINT mytable_related_id_fkey FOREIGN KEY (related_id)
REFERENCES related (id)
);
CREATE INDEX related_id__platform__index ON mytable (related_id, platform);
CREATE UNIQUE INDEX some_other_index ON mytable (related_id, lower(name::text));
Queries and plans:
This query returns 0 rows:
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms
LIMIT 20;
Limit (cost=0.00..14.07 rows=20 width=737) (actual time=12863.465..12863.465 rows=0 loops=1)
-> Seq Scan on mytable (cost=0.00..1492790.47 rows=2122653 width=737) (actual time=12863.464..12863.464 rows=0 loops=1)
Filter: ((related_id = 1) AND (platform = 'p2'::platforms))
Rows Removed by Filter: 21076656
Planning time: 3.540 ms
Execution time: 12868.190 ms
This query also returns 0 rows:
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms;
Bitmap Heap Scan on mytable (cost=60533.63..1295799.94 rows=2122653 width=737) (actual time=0.890..0.890 rows=0 loops=1)
Recheck Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
-> Bitmap Index Scan on related_id__platform__index (cost=0.00..60002.97 rows=2122653 width=0) (actual time=0.888..0.888 rows=0 loops=1)
Index Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
Planning time: 0.827 ms
Execution time: 1.104 ms
This query returns 20 rows (without LIMIT
it would be upwards of 2 million rows):
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p1'::platforms
LIMIT 20;
Limit (cost=0.44..70.95 rows=20 width=737) (actual time=0.759..0.995 rows=20 loops=1)
-> Index Scan using related_id__platform__index on mytable (cost=0.44..1217669.26 rows=345388 width=737) (actual time=0.759..0.993 rows=20 loops=1)
Index Cond: ((related_id = 1) AND (platform = 'p1'::platforms))
Planning time: 5.776 ms
Execution time: 2.476 ms
This query returns 2 rows:
EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p3'::platforms LIMIT 20;
Limit (cost=0.44..80.37 rows=20 width=737) (actual time=0.014..0.016 rows=2 loops=1)
-> Index Scan using related_id__platform__index on mytable (cost=0.44..99497.62 rows=24894 width=737) (actual time=0.014..0.015 rows=2 loops=1)
Index Cond: ((related_id = 1) AND (platform = 'p3'::platforms))
Planning time: 0.972 ms
Execution time: 0.123 ms
Best Answer
Postgres does a bad job estimating the frequency of the combination of predicates in your query:
Each of your predicates is not very selective on its own - that information is available to Postgres ("most common values") - assuming your statistics are up to date:
IOW, ~ every 7th row passes the 1st filter and ~ every 3rd row passes the 2nd. Postgres does the (naive) math and expects that roughly every 20th row qualifies. Since there is no
ORDER BY
, any 20 qualifying rows will do. The fastest way should be to scan the table sequentially and be done after ~ 400 rows - only a few data pages, very cheap.Using any indexes adds some extra cost, Postgres needs to scan index and table. (Exception: index-only scans, which are not possible here for your
SELECT *
). That will only pay if Postgres would otherwise have to read enough additional pages to estimate that to be more expensive. That's how I would explain that you see a sequential scan for a smallLIMIT
, but a bitmap index scan for a big (or no)LIMIT
.Unfortunately, the combination of your predicates turns out to be unexpectedly rare. Postgres has to scan the whole table just to find only 2 qualifying rows. (The index would actually be much cheaper in any case.)
The combined frequency of values in multiple columns is not available to Postgres. Think about it: gathering statistics like that would quickly get out of hand.
A very simple and efficient solution for this particular case: Create a partial index:
Not only will this super-tiny index (2 rows) be a perfect match for your query, it will also make a count estimate available for your particular combination (entry in
pg_class.reltuples
). The actual index column is mostly irrelevant for this one, pick a small column, typically best to make it the PK.If one of the two predicates can change, there is a more generic approach. Say, if
related_id = 1
is the stable condition, then create:The index column is relevant again. This may not be enough to tip the scales because Postgres only collects complete statistics for index columns of functional indexes (else it relies on the statistics of the underlying table). I propose:
Note the extra pair of parentheses - a syntactical necessity for the cast shorthand.
The expression
platform::text::platforms
doesn't actually change anything - it casts yourenum
totext
and back. But it makes Postgres collect full statistics on the (presumed) new values.Now, (after
ANALYZE tbl
) we have full statistics including most common values ofplatform
forrelated_id = 1
.Check with:
And Postgres should choose the index for your case - if you repeat the same expression in your query. So:
Related:
About most common values in the Postgres statistics: