I have a table of about 700k rows in Postgres 9.3.3, which has the following structure:
Columns:
content_body - text
publish_date - timestamp without time zone
published - boolean
Indexes:
"articles_pkey" PRIMARY KEY, btree (id)
"article_text_gin" gin (article_text)
"articles_publish_date_id_index" btree (publish_date DESC NULLS LAST, id DESC)
The query that I am making has a full text search query and a limit, as follows:
When I search for a string which is in my index with a limit and order in the query it is fast:
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..1293.88 rows=10 width=1298) (actual time=2.073..9.837 rows=10 loops=1)
-> Index Scan using articles_pkey on articles (cost=0.42..462150.49 rows=3573 width=1298) (actual time=2.055..9.711 rows=10 loops=1)
Filter: (article_text @@ '''in_index'''::tsquery)
Rows Removed by Filter: 611
Total runtime: 9.952 ms
However if the string is not in the index it takes much longer:
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..1293.88 rows=10 width=1298) (actual time=5633.684..5633.684 rows=0 loops=1)
-> Index Scan using articles_pkey on articles (cost=0.42..462150.49 rows=3573 width=1298) (actual time=5633.672..5633.672 rows=0 loops=1)
Filter: (article_text @@ '''not_in_index'''::tsquery)
Rows Removed by Filter: 796146
Total runtime: 5633.745 ms
However if I remove the order clause it is fast for either case:
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=55.69..90.22 rows=10 width=1298) (actual time=7.748..7.853 rows=10 loops=1)
-> Bitmap Heap Scan on articles (cost=55.69..12390.60 rows=3573 width=1298) (actual time=7.735..7.781 rows=10 loops=1)
Recheck Cond: (article_text @@ '''in_index'''::tsquery)
-> Bitmap Index Scan on article_text_gin (cost=0.00..54.80 rows=3573 width=0) (actual time=5.977..5.977 rows=8910 loops=1)
Index Cond: (article_text @@ '''in_index'''::tsquery)
Total runtime: 7.952 ms
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=55.69..90.22 rows=10 width=1298) (actual time=0.083..0.083 rows=0 loops=1)
-> Bitmap Heap Scan on articles (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.065..0.065 rows=0 loops=1)
Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
-> Bitmap Index Scan on article_text_gin (cost=0.00..54.80 rows=3573 width=0) (actual time=0.047..0.047 rows=0 loops=1)
Index Cond: (article_text @@ '''not_in_index'''::tsquery)
Total runtime: 0.163 ms
Removing the limit clause has the same effect, although the in index query is noticably slower:
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'in_index') order by id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=106.347..140.481 rows=8910 loops=1)
Sort Key: id
Sort Method: external merge Disk: 12288kB
-> Bitmap Heap Scan on articles (cost=55.69..12390.60 rows=3573 width=1298) (actual time=5.618..50.329 rows=8910 loops=1)
Recheck Cond: (article_text @@ '''in_index'''::tsquery)
-> Bitmap Index Scan on article_text_gin (cost=0.00..54.80 rows=3573 width=0) (actual time=4.243..4.243 rows=8910 loops=1)
Index Cond: (article_text @@ '''in_index'''::tsquery)
Total runtime: 170.987 ms
explain analyze select * from "articles" where article_text @@ plainto_tsquery('pg_catalog.simple', 'not_in_index') order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=12601.46..12610.40 rows=3573 width=1298) (actual time=0.067..0.067 rows=0 loops=1)
Sort Key: id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on articles (cost=55.69..12390.60 rows=3573 width=1298) (actual time=0.044..0.044 rows=0 loops=1)
Recheck Cond: (article_text @@ '''not_in_index'''::tsquery)
-> Bitmap Index Scan on article_text_gin (cost=0.00..54.80 rows=3573 width=0) (actual time=0.026..0.026 rows=0 loops=1)
Index Cond: (article_text @@ '''not_in_index'''::tsquery)
Total runtime: 0.148 ms
The little I can deduce is that overall, a bitmap index scan+bitmap heap scan is overall better for my queries then an index scan. How can I tell the query planner to do that though?
Best Answer
The big difference between the first two queries is that int the first one, it could go along the index used by the primary key of the table (and used by the
ORDER BY
clause), then filter out the rows that don't match theWHERE
condition. You can see that it had to visit about 621 rows (the 10 that got returned and 611 which were filtered) to get ready.Now the second one used the same logic, but not having found a single match (not to mention 10), it had to go through the whole index and throw away all rows (
Rows Removed by Filter: 796146
).The second pair, without ordering, chose a different plan, which in this case happened to be more effective for returning 0 rows :)
And the third pair, knowing it has to return lots of rows (it planned 3573 as opposed to 10), again went for a different plan, with a bitmap heap scan (not a bitmap index scan, as in the second pair). The time difference can be attributed mostly to this node:
If you raised
work_mem
to a higher value (say 100 MB), this difference would mostly go away, I guess.