Postgresql – Why does this limit make the postgres planner use a much slower index scan instead of a much faster bitmap heap/index scan

performancepostgresql

I have a table of about 700k rows in Postgres 9.3.3, which has the following structure:

Columns:
 content_body  - text                        
 publish_date  - timestamp without time zone 
 published     - boolean       

Indexes:
    "articles_pkey" PRIMARY KEY, btree (id)
    "article_text_gin" gin (article_text)
    "articles_publish_date_id_index" btree (publish_date DESC NULLS LAST, id DESC)

The query that I am making has a full text search query and a limit, as follows:

When I search for a string which is in my index with a limit and order in the query it is fast:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id limit 10;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1293.88 rows=10 width=1298) (actual time=2.073..9.837 rows=10 loops=1)
   ->  Index Scan using articles_pkey on articles  (cost=0.42..462150.49 rows=3573 width=1298) (actual time=2.055..9.711 rows=10 loops=1)
         Filter: (article_text @@ '''in_index'''::tsquery)
         Rows Removed by Filter: 611
 Total runtime: 9.952 ms

However if the string is not in the index it takes much longer:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..1293.88 rows=10 width=1298) (actual time=5633.684..5633.684 rows=0 loops=1)
   ->  Index Scan using articles_pkey on articles  (cost=0.42..462150.49 rows=3573 width=1298) (actual time=5633.672..5633.672 rows=0 loops=1)
         Filter: (article_text @@ '''not_in_index'''::tsquery)
         Rows Removed by Filter: 796146
 Total runtime: 5633.745 ms

However if I remove the order clause it is fast for either case:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index')  limit 10;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=55.69..90.22 rows=10 width=1298) (actual time=7.748..7.853 rows=10 loops=1)
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=7.735..7.781 rows=10 loops=1)
         Recheck Cond: (article_text @@ '''in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=5.977..5.977 rows=8910 loops=1)
               Index Cond: (article_text @@ '''in_index'''::tsquery)
 Total runtime: 7.952 ms


explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index')  limit 10;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=55.69..90.22 rows=10 width=1298) (actual time=0.083..0.083 rows=0 loops=1)
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.065..0.065 rows=0 loops=1)
         Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=0.047..0.047 rows=0 loops=1)
               Index Cond: (article_text @@ '''not_in_index'''::tsquery)
 Total runtime: 0.163 ms

Removing the limit clause has the same effect, although the in index query is noticably slower:

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=106.347..140.481 rows=8910 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 12288kB
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=5.618..50.329 rows=8910 loops=1)
         Recheck Cond: (article_text @@ '''in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=4.243..4.243 rows=8910 loops=1)
               Index Cond: (article_text @@ '''in_index'''::tsquery)
 Total runtime: 170.987 ms

explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=0.067..0.067 rows=0 loops=1)
   Sort Key: id
   Sort Method: quicksort  Memory: 25kB
   ->  Bitmap Heap Scan on articles  (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.044..0.044 rows=0 loops=1)
         Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
         ->  Bitmap Index Scan on article_text_gin  (cost=0.00..54.80 rows=3573 width=0) (actual time=0.026..0.026 rows=0 loops=1)
               Index Cond: (article_text @@ '''not_in_index'''::tsquery)
 Total runtime: 0.148 ms

The little I can deduce is that overall, a bitmap index scan+bitmap heap scan is overall better for my queries then an index scan. How can I tell the query planner to do that though?

Best Answer

The big difference between the first two queries is that int the first one, it could go along the index used by the primary key of the table (and used by the ORDER BY clause), then filter out the rows that don't match the WHERE condition. You can see that it had to visit about 621 rows (the 10 that got returned and 611 which were filtered) to get ready.

Now the second one used the same logic, but not having found a single match (not to mention 10), it had to go through the whole index and throw away all rows (Rows Removed by Filter: 796146).

The second pair, without ordering, chose a different plan, which in this case happened to be more effective for returning 0 rows :)

And the third pair, knowing it has to return lots of rows (it planned 3573 as opposed to 10), again went for a different plan, with a bitmap heap scan (not a bitmap index scan, as in the second pair). The time difference can be attributed mostly to this node:

Sort Method: external merge Disk: 12288kB

If you raised work_mem to a higher value (say 100 MB), this difference would mostly go away, I guess.