PostgreSQL not using index scan when a second ORDER BY column is present

index-tuningperformancepostgresqlpostgresql-9.5query-performance

I need to query the latest entry from a table with >130mn entries using PostgreSQL 9.5. There are two columns of interest, row_id (integer, the primary key), and row_time (timestamp, not null, with btree index).

This query works very fast:

EXPLAIN SELECT * FROM tbl ORDER BY row_time DESC LIMIT 1;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..0.88 rows=1 width=423)
   ->  Index Scan Backward using tbl_row_time on tbl  (cost=0.57..40503877.59 rows=131390688 width=423)

In case there are several rows with the same row_time I would like to sort by row_id as a tie-breaker. Unfortunately, the query plan then regresses to a sequential scan:

EXPLAIN SELECT * FROM tbl ORDER BY row_time DESC, row_id DESC LIMIT 1;
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Limit  (cost=9602337.32..9602337.32 rows=1 width=423)
   ->  Sort  (cost=9602337.32..9930814.04 rows=131390688 width=423)
         Sort Key: row_time DESC, row_id DESC
         ->  Seq Scan on tbl  (cost=0.00..8945383.88 rows=131390688 width=423)

I don't understand why this is happening. It would be easy to retrieve the rows with the maximum row_time using the index (typically less than 10 rows) and then sort just these. Instead, the full table is scanned, which takes minutes instead of the sub-millisecond time of the first query.

Creating a multicolumn index on (row_time, row_id) makes PostgreSQL chose an index scan again, but increases the size of the index considerably. A subquery to get MAX(row_time), then filter by it and sort the result by row_id is also fast but seems overly complicated for my simple goal. Am I missing something here?

Best Answer

If you create an index on (row_time DESC, row_id DESC), then the second case will will act just as the first one.

LIMIT will always operate after sorting is taken care of, so when sorting without an index is necessary, it will sort the entire recordset prior to processing it through LIMIT.

I wonder what is it you want to achieve? Applying LIMIT over a primary indexed order, then including all rows matching the sorted field(s) in the limited result, and finally sorting the limited results by any other fields and limit again. This is the way to go in this case:

SELECT *
FROM tbl
WHERE row_time = (
  SELECT row_time
  FROM tbl
  ORDER BY row_time DESC
  LIMIT 1)
ORDER BY row_id DESC
LIMIT 1;

Whereas for LIMIT n where n > 1:

SELECT *
FROM tbl
WHERE row_time IN (
  SELECT row_time
  FROM tbl
  ORDER BY row_time DESC
  LIMIT n)
ORDER BY row_id DESC
LIMIT n;

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

PostgreSQL Optimization – Explain Plan with Giant Gaps Between Operations

Related Question