I have a big table entities
with ~ 15M records. I want to find top 5 rows matching 'hockey' in their name
.
I have a full text index on name
, which is used: gin_ix_entity_full_text_search_name
Query:
SELECT "entities".*,
ts_rank(to_tsvector('english', "entities"."name"::text),
to_tsquery('english', 'hockey'::text)) AS "rank0.48661998202865475"
FROM "entities"
WHERE "entities"."place" = 'f'
AND (to_tsvector('english', "entities"."name"::text) @@ to_tsquery('english', 'hockey'::text))
ORDER BY "rank0.48661998202865475" DESC LIMIT 5
Duration 25,623 ms
Explain plan 1 Limit (cost=12666.89..12666.89 rows=5 width=3116) 2 -> Sort (cost=12666.89..12670.18 rows=6571 width=3116) 3 Sort Key: (ts_rank(to_tsvector('english'::regconfig, (name)::text), '''hockey'''::tsquery)) 4 -> Bitmap Heap Scan on entities (cost=124.06..12645.06 rows=6571 width=3116) 5 Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''hockey'''::tsquery) 6 Filter: (NOT place) 7 -> Bitmap Index Scan on gin_ix_entity_full_text_search_name (cost=0.00..123.74 rows=6625 width=0) 8 Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''hockey'''::tsquery)
I do not understand why it verifies the index condition twice. (Query plan step 4 and 7). Is it because of my boolean condition (not place
)? If so, should I add it to my index to get a very fast query? Or does the sorting condition make it slow?
EXPLAIN ANALYZE
output:
Limit (cost=4447.28..4447.29 rows=5 width=3116) (actual time=18509.274..18509.282 rows=5 loops=1) -> Sort (cost=4447.28..4448.41 rows=2248 width=3116) (actual time=18509.271..18509.273 rows=5 loops=1) Sort Key: (ts_rank(to_tsvector('english'::regconfig, (name)::text), '''test'''::tsquery)) Sort Method: top-N heapsort Memory: 19kB -> Bitmap Heap Scan on entities (cost=43.31..4439.82 rows=2248 width=3116) (actual time=119.003..18491.408 rows=2533 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''test'''::tsquery) Filter: (NOT place) -> Bitmap Index Scan on gin_ix_entity_full_text_search_name (cost=0.00..43.20 rows=2266 width=0) (actual time=74.093..74.093 rows=2593 loops=1) Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''test'''::tsquery) Total runtime: 18509.381 ms
Here are my DB parameters. It is hosted by Heroku, on Amazon services. They describe it as having 1.7GB of RAM, 1 processing unit and a DB of max 1TB.
name | current_setting ------------------------------+--------------------------------------------------------------------------------------------------------- version | PostgreSQL 9.0.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit archive_command | test -f /etc/postgresql/9.0/main/wal-e.d/ARCHIVING_OFF || envdir /etc/postgresql/9.0/resource29857_heroku_com/wal-e.d/env wal-e wal-push %p archive_mode | on archive_timeout | 1min checkpoint_completion_target | 0.7 checkpoint_segments | 40 client_min_messages | notice cpu_index_tuple_cost | 0.001 cpu_operator_cost | 0.0005 cpu_tuple_cost | 0.003 effective_cache_size | 1530000kB hot_standby | on lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 listen_addresses | * log_checkpoints | on log_destination | syslog log_line_prefix | %u [YELLOW] log_min_duration_statement | 50ms log_min_messages | notice logging_collector | on maintenance_work_mem | 64MB max_connections | 500 max_prepared_transactions | 500 max_stack_depth | 2MB max_standby_archive_delay | -1 max_standby_streaming_delay | -1 max_wal_senders | 10 port | random_page_cost | 2 server_encoding | UTF8 shared_buffers | 415MB ssl | on syslog_ident | resource29857_heroku_com TimeZone | UTC wal_buffers | 8MB wal_keep_segments | 127 wal_level | hot_standby work_mem | 100MB (39 rows)
EDIT
Looks like ORDER BY
is the slow part:
d6ifslbf0ugpu=> EXPLAIN ANALYZE SELECT "entities"."name",
ts_rank(to_tsvector('english', "entities"."name"::text),
to_tsquery('english', 'banana'::text)) AS "rank0.48661998202865475"
FROM "entities"
WHERE (to_tsvector('english', "entities"."name"::text) @@ to_tsquery('english', 'banana'::text))
LIMIT 5;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=43.31..53.07 rows=5 width=24) (actual time=76.583..103.623 rows=5 loops=1)
-> Bitmap Heap Scan on entities (cost=43.31..4467.60 rows=2266 width=24) (actual time=76.581..103.613 rows=5 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''banana'''::tsquery)
-> Bitmap Index Scan on gin_ix_entity_full_text_search_name (cost=0.00..43.20 rows=2266 width=0) (actual time=53.592..53.592 rows=1495 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''banana'''::tsquery)
Total runtime: 103.680 ms
Vs. with ORDER BY
:
d6ifslbf0ugpu=> EXPLAIN ANALYZE SELECT "entities"."name",
ts_rank(to_tsvector('english', "entities"."name"::text),
to_tsquery('english', 'banana'::text)) AS "rank0.48661998202865475"
FROM "entities"
WHERE (to_tsvector('english', "entities"."name"::text) @@ to_tsquery('english', 'banana'::text))
ORDER BY "rank0.48661998202865475" DESC
LIMIT 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4475.12..4475.13 rows=5 width=24) (actual time=15013.735..15013.741 rows=5 loops=1)
-> Sort (cost=4475.12..4476.26 rows=2266 width=24) (actual time=15013.732..15013.735 rows=5 loops=1)
Sort Key: (ts_rank(to_tsvector('english'::regconfig, (name)::text), '''banana'''::tsquery))
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on entities (cost=43.31..4467.60 rows=2266 width=24) (actual time=0.872..15006.763 rows=1495 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''banana'''::tsquery)
-> Bitmap Index Scan on gin_ix_entity_full_text_search_name (cost=0.00..43.20 rows=2266 width=0) (actual time=0.549..0.549 rows=1495 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''banana'''::tsquery)
Total runtime: 15013.805 ms
Bit I still don't understand why this is slower. Looks like it's fetching the same amount of rows from Bitmap Heap Scan, but it takes so much longer?
Best Answer
That sorting the rows will cost something is obvious. But why so much?
Without
ORDER BY rank0...
Postgres can justts_rank()
for just 5 rows.ts_rank()
for all of them.ORDER BY name
just to see the cost of computingto_tsquery('english', 'hockey'::text))
for the superfluous rows and how much remains for fetching more rows and sorting.