I have a table that contains data that is extracted from text documents. The data is stored in a column called "CONTENT"
for which I have created this index using GIN:
CREATE INDEX "File_contentIndex"
ON "File"
USING gin
(setweight(to_tsvector('english'::regconfig
, COALESCE("CONTENT", ''::character varying)::text), 'C'::"char"));
I use the following query to perform a full text search on the table:
SELECT "ITEMID",
ts_rank(setweight(to_tsvector('english', coalesce("CONTENT",'')), 'C') ,
plainto_tsquery('english', 'searchTerm')) AS "RANK"
FROM "File"
WHERE setweight(to_tsvector('english', coalesce("CONTENT",'')), 'C')
@@ plainto_tsquery('english', 'searchTerm')
ORDER BY "RANK" DESC
LIMIT 5;
The File table contains 250 000 rows and each "CONTENT"
entry consists of one random word and a text string that is the same for all rows.
Now, when I search for a random word (1 hit in the whole table) the query runs very fast (<100 ms). However, when I search for a word that is present in all the rows, the query runs extremely slow (10 minutes or more).
EXPLAIN ANALYZE
shows that for the 1-hit search a Bitmap Index Scan followed by a Bitmap Heap Scan is performed. For the slow search a Seq Scan is performed instead, which is what is taking so long.
Granted, it is not realistic to have the same data in all rows. But since I can't control the text documents that's uploaded by the users, nor the searches they perform it is possible that a similar scenario arises (search on terms with very high occurrence in DB). How can I increase the performance of my search query for such a scenario?
Running PostgreSQL 9.3.4
Query plans from EXPLAIN ANALYZE
:
Quick search (1 hit in DB)
"Limit (cost=2802.89..2802.90 rows=5 width=26) (actual time=0.037..0.037 rows=1 loops=1)"
" -> Sort (cost=2802.89..2806.15 rows=1305 width=26) (actual time=0.037..0.037 rows=1 loops=1)"
" Sort Key: (ts_rank(setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char"), '''wfecg'''::tsquery))"
" Sort Method: quicksort Memory: 25kB"
" -> Bitmap Heap Scan on "File" (cost=38.12..2781.21 rows=1305 width=26) (actual time=0.030..0.031 rows=1 loops=1)"
" Recheck Cond: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''wfecg'''::tsquery)"
" -> Bitmap Index Scan on "File_contentIndex" (cost=0.00..37.79 rows=1305 width=0) (actual time=0.012..0.012 rows=1 loops=1)"
" Index Cond: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''wfecg'''::tsquery)"
"Total runtime: 0.069 ms"
Slow search (250k hits in DB)
"Limit (cost=14876.82..14876.84 rows=5 width=26) (actual time=519667.404..519667.405 rows=5 loops=1)"
" -> Sort (cost=14876.82..15529.37 rows=261017 width=26) (actual time=519667.402..519667.402 rows=5 loops=1)"
" Sort Key: (ts_rank(setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char"), '''cyberspace'''::tsquery))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Seq Scan on "File" (cost=0.00..10541.43 rows=261017 width=26) (actual time=2.097..519465.953 rows=261011 loops=1)"
" Filter: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''cyberspace'''::tsquery)"
" Rows Removed by Filter: 6"
"Total runtime: 519667.429 ms"
Best Answer
Questionable use case
A text string that is the same for all rows is just dead freight. Remove it and concatenate it in a view if you need to show it.
Obviously, you are aware of that:
Upgrade your Postgres version
While still on Postgres 9.3, you should at least upgrade to the latest point release (currently 9.3.9). The official recommendation of the project:
Better yet, upgrade to 9.4 which has received major improvements for GIN indexes.
Major problem 1: Cost estimates
The cost of some textsearch functions has been seriously underestimated up to and including version 9.4. That cost is raised by factor 100 in the upcoming version 9.5 like @jjanes describes in his recent answer:
Here are the respective thread where this was discussed and the commit message by Tom Lane.
As you can see in the commit message,
to_tsvector()
is among those functions. You can apply the change immediately (as superuser):which should make it much more likely that your functional index is used.
Major problem 2: KNN
The core problem is that Postgres has to calculate a rank with
ts_rank()
for 260k rows (rows=261011
) before it can order by and pick the top 5. This is going to be expensive, even after you have fixed other problems as discussed. It's a K-nearest-neighbour (KNN) problem by nature and there are solutions for related cases. But I cannot think of a general solution for your case, since the rank calculation itself depends on user input. I would try to eliminate the bulk of low ranking matches early so that the full calculation only has to be done for few good candidates.One way I can think of is to combine your fulltext search with trigram similarity search - which offers a working implementation for the KNN problem. This way you can pre-select the "best" matches with
LIKE
predicate as candidates (in a subquery withLIMIT 50
for example) and then pick the 5 top-ranking rows according to your rank-calculation in the main query.Or apply both predicates in the same query and pick the closest matches according to trigram similarity (which would produce different results) like in this related answer:
I did some more research and you are not the first to run into this problem. Related posts on pgsql-general:
Work is ongoing to eventually implement a
tsvector <-> tsquery
operator.Oleg Bartunov and Alexander Korotkov even presented a working prototype (using
><
as operator instead of<->
back then) but it's very complex to integrate in Postgres, the whole infrastructure for GIN indexes has to be reworked (most of which is done by now).Major problem 3: weights and index
And I identified one more factor adding to the slowness of the query. Per documentation:
Bold emphasis mine. As soon as weight are involved, each row has to be fetched from the heap (not just a cheap visibility check) and long values have to be de-toasted, which adds to the cost. But there seems to be a solution for that:
Index definition
Looking at your index again, it doesn't seem to make sense to begin with. You assign a weight to a single column, which is meaningless as long as you don't concatenate other columns with a different weight.
COALESCE()
also makes no sense as long as you don't actually concatenate more columns.Simplify your index:
And your query:
Still expensive for a search term that matches every row, but probably much less.
Asides
All of these issues combined, the insane cost of 520 seconds for your second query is beginning to make sense. But there still may be more problems. Did you configure your server?
All the usual advice for performance optimization applies.
It makes your life easier if you don't work with double-quotes CaMeL-case identifiers: