Postgresql – postgres includes listed column in sort key and thus fails to use index

This is a query that occurs reasonably frequently:

SELECT DISTINCT id, published_year, first_author 
FROM publication 
ORDER BY published_year DESC NULLS LAST 
LIMIT 10;

There is an index that could be used for this:

CREATE INDEX publication_published_year_idcombineddesc2_btree 
   ON public.publication USING btree (published_year DESC NULLS LAST, id)

However, it is not used, and thus the query lasts 30-40 seconds:

EXPLAIN ANALYZE SELECT DISTINCT id, published_year, first_author FROM publication ORDER BY published_year DESC NULLS LAST LIMIT 10;
                                                              QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3527756.39..3527756.49 rows=10 width=23) (actual time=33860.282..33860.292 rows=10 loops=1)
   ->  Unique  (cost=3527756.39..3623290.85 rows=9553446 width=23) (actual time=33860.280..33860.288 rows=10 loops=1)
         ->  Sort  (cost=3527756.39..3551640.01 rows=9553446 width=23) (actual time=33860.278..33860.281 rows=10 loops=1)
               Sort Key: published_year DESC NULLS LAST, id, first_author
               Sort Method: external merge  Disk: 310440kB
               ->  Seq Scan on publication  (cost=0.00..2224226.46 rows=9553446 width=23) (actual time=0.036..20842.499 rows=9553526 loops=1)
 Planning time: 1.157 ms
 Execution time: 33945.457 ms

Now, if I remove the first_author or DISTINCT, the query runs nice and fast:

EXPLAIN ANALYZE SELECT DISTINCT id, published_year FROM publication ORDER BY published_year DESC NULLS LAST LIMIT 80;
                                                                                      QUERY PLAN                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..4.57 rows=80 width=10) (actual time=0.055..0.250 rows=80 loops=1)
   ->  Unique  (cost=0.43..493764.18 rows=9553446 width=10) (actual time=0.053..0.236 rows=80 loops=1)
         ->  Index Only Scan using publication_published_year_idcombineddesc2_btree on publication  (cost=0.43..445996.95 rows=9553446 width=10) (actual time=0.051..0.204 rows=80 loops=1)
               Heap Fetches: 1
 Planning time: 1.487 ms
 Execution time: 0.300 ms

EXPLAIN ANALYZE SELECT id, published_year, first_author FROM publication ORDER BY published_year DESC NULLS LAST LIMIT 10;
                                                                                 QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..9.63 rows=10 width=23) (actual time=0.062..0.190 rows=10 loops=1)
   ->  Index Scan using publication_published_year_idcombineddesc2_btree on publication  (cost=0.43..8787896.00 rows=9553446 width=23) (actual time=0.061..0.186 rows=10 loops=1)
 Planning time: 1.380 ms
 Execution time: 0.223 ms

So my question is: is there a way to coerce the query planner to use an index in the first case?

UPDATE:

I tried to change random_page_cost, cursor_tuple_fraction (to various values, even to 0) and enable_seqscan to off; to no avail.

UPDATE 2:

I think I had a revelation about the "why first_author is included in the sort key": the DISTINCT keyword affects the whole retrieved tuple, so Postgres thinks that there may be duplicate (id, published_year, first_author) tuple, thus needs to check all potential such tuples so it does not miss one. Because it misses the fact that id is the primary key and thus unique, and so any tuple that contains it without joins in the FROM part is also necessarily unique. Even disregarding the above fact, it could take a gamble and get some (id, published_year) pairs from the appropriately sorted index, and start retrieving the corresponding first_author from the disk and select some unique from them (they are already in good order according to the query), and if the limit is not reached then try again with some more from the index, etc.

It would be nice to be able to force that behaviour, but it does not seem likely, thus at the end we will need to change the query generator to omit DISTINCT if there is no join in the query and it contains a primary key.

Best Answer

Why just take things by the smooth handle, and build the index that works?:

CREATE INDEX ON public.publication (published_year DESC NULLS LAST, id, first_author);

You can then get rid of the other one, as it serves no purpose that can't be served by this one (the other one may be slightly smaller, but that is generally not significant)

The version of PostgreSQL currently being developed, v13, introduces "incremental sort", which does more or less what you want with your existing index. But it still won't be as fast as just building the right index in the first place.

UPDATE:

UPDATE 2:

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Postgresql – postgres explain plan with giant gaps between operations

Related Question