Why LIMIT Affects Postgres Query Performance

performancepostgresqlquery-performance

I have a Postgres materialized view:

Column        |       Type        | Modifiers 
---------------------+-------------------+-----------
document_id         | character varying | 
recorded_date       | date              | 
parcels             | jsonb             | 
Indexes:
"index_my_view_on_document_id" btree (document_id)
"index_my_view_on_recorded_date" btree (recorded_date)
"index_my_view_on_parcels" gin (parcels)

I'm trying to execute a paged query that filters on the parcels jsonb array field, but my performance tanks whenever I add LIMIT:

Without LIMIT:

EXPLAIN ANALYZE SELECT document_id FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=24178.50..24194.79 rows=6518 width=21) (actual time=11.272..11.275 rows=22 loops=1)
   Sort Key: recorded_date DESC
   Sort Method: quicksort  Memory: 26kB
   ->  Bitmap Heap Scan on my_view  (cost=78.51..23765.58 rows=6518 width=21) (actual time=3.199..10.281 rows=22 loops=1)
         Recheck Cond: (parcels @> '[3022890014]'::jsonb)
         Heap Blocks: exact=12
         ->  Bitmap Index Scan on index_my_view_on_parcels  (cost=0.00..76.88 rows=6518 width=0) (actual time=3.166..3.166 rows=22 loops=1)
               Index Cond: (parcels @> '[3022890014]'::jsonb)
 Planning time: 2.201 ms
 Execution time: 11.395 ms
(10 rows)

With LIMIT:

EXPLAIN ANALYZE SELECT document_id FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC LIMIT 25;
                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..2514.14 rows=25 width=21) (actual time=10471.981..17971.454 rows=22 loops=1)
   ->  Index Scan Backward using index_my_view_on_recorded_date on my_view  (cost=0.43..655374.28 rows=6518 width=21) (actual time=10471.980..17971.446 rows=22 loops=1)
         Filter: (parcels @> '[3022890014]'::jsonb)
         Rows Removed by Filter: 6517780
 Planning time: 0.164 ms
 Execution time: 17972.229 ms
(6 rows)

Adding LIMIT slows query by 1000x!

I was able to bypass this issue by doing a nested query, as suggested here:

EXPLAIN ANALYZE SELECT * FROM (SELECT document_id, recorded_date FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC) subq ORDER BY recorded_date DESC LIMIT 25;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=24178.50..24178.81 rows=25 width=21) (actual time=2.180..2.183 rows=22 loops=1)
   ->  Sort  (cost=24178.50..24194.79 rows=6518 width=21) (actual time=2.179..2.179 rows=22 loops=1)
         Sort Key: my_view.recorded_date DESC
         Sort Method: quicksort  Memory: 26kB
         ->  Bitmap Heap Scan on my_view  (cost=78.51..23765.58 rows=6518 width=21) (actual time=2.064..2.166 rows=22 loops=1)
               Recheck Cond: (parcels @> '[3022890014]'::jsonb)
               Heap Blocks: exact=12
               ->  Bitmap Index Scan on index_my_view_on_parcels  (cost=0.00..76.88 rows=6518 width=0) (actual time=2.030..2.030 rows=22 loops=1)
                     Index Cond: (parcels @> '[3022890014]'::jsonb)
 Planning time: 6.427 ms
 Execution time: 2.230 ms
(11 rows)

Still, I'd like to understand why adding LIMIT results in such a huge change in performance, and if there are better ways to address this issue.

Best Answer

PostgreSQL thinks it will find 6518 rows meeting your condition. So when you tell it to stop at 25, it thinks it would rather scan the rows already in order and stop after it finds the 25th one in order, which is after 25/6518, or 0.4%, of the table. But in reality there are only 22 rows which meet the requirements, so it ends up having to scan the whole table instead, being 250 times more work than it thought it would be. The other plan, using the gin index, ends up being over 250 times less work than PostgreSQL thinks it will be, for the same reason--it thinks it will to find and then sort 6518 thing when it will really be 22 things.

If you used a more appropriate data structure, like a regular PostgreSQL array rather than a degenerate JSONB object, then it would have a more accurate idea of how many rows will meet the condition and would probably make a better choice.