Assessment
In your last query, the bitmap index scan looking for 'hat' produces 307 hits.
Postgres then runs a bitmap heap scan to filter merchants similar enough ( similarity(...) > 0.2
), producing 12 rows. Your test is with 30K rows, so your real life query will produce around 300 times as many hits, 90k / 3.5k for the test case at hand. An additional index on merchant
will help.
Advice
I suggest you create an additional trigram index for the similarity search. Be sure to read the chapter in the manual about trigram index support. We need the additional module pg_trgm
installed (like you obviously have).
For your first request:
How can I search for a query like 'WALMART BAGS' which will first
return me product BAG with merchant WALMART and then BAGS from other merchants.
I suggest this query using the similarity operator %
:
-- SELECT set_limit(0.2) -- Adjust similarity operator only if needed
SELECT *
FROM products
WHERE to_tsvector('english', product) @@ to_tsquery('bag')
AND merchant % 'walmart'
ORDER BY merchant <-> 'walmart'
-- LIMIT n; -- possibly limit to top n results
Again, you can choose between GiST and GIN, but this time GiST carries a decisive advantage:
This can be implemented quite efficiently by GiST indexes, but not by
GIN indexes. It will usually beat the first formulation when only a
small number of the closest matches is wanted.
Therefore, I suggest this index:
CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);
As for your second request:
Can I have both GIN and GIST index working for me?
Yes, you can. It would hardly make sense to have both types for the same (combination of) column(s), but Postgres can combine GiST and GIN indices very well in the same query. I quote the excellent manual yet again, on Combining Multiple Indexes:
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of table rows that
are reported as matching that index's conditions. The bitmaps are then
ANDed and ORed together as needed by the query. Finally, the actual
table rows are visited and returned. The table rows are visited in
physical order, because that is how the bitmap is laid out; this means
that any ordering of the original indexes is lost, and so a separate
sort step will be needed if the query has an ORDER BY
clause. For this
reason, and because each additional index scan adds extra time, the
planner will sometimes choose to use a simple index scan even though
additional indexes are available that could have been used as well.
Questionable use case
...each CONTENT entry consists of one random word and a text string that is the same for all rows.
A text string that is the same for all rows is just dead freight. Remove it and concatenate it in a view if you need to show it.
Obviously, you are aware of that:
Granted, it is not realistic ... But since I can't control the text ...
Upgrade your Postgres version
Running PostgreSQL 9.3.4
While still on Postgres 9.3, you should at least upgrade to the latest point release (currently 9.3.9). The official recommendation of the project:
We always recommend that all users run the latest available minor
release for whatever major version is in use.
Better yet, upgrade to 9.4 which has received major improvements for GIN indexes.
Major problem 1: Cost estimates
The cost of some textsearch functions has been seriously underestimated up to and including version 9.4. That cost is raised by factor 100 in the upcoming version 9.5 like @jjanes describes in his recent answer:
Here are the respective thread where this was discussed and the commit message by Tom Lane.
As you can see in the commit message, to_tsvector()
is among those functions. You can apply the change immediately (as superuser):
ALTER FUNCTION to_tsvector (regconfig, text) COST 100;
which should make it much more likely that your functional index is used.
Major problem 2: KNN
The core problem is that Postgres has to calculate a rank with ts_rank()
for 260k rows (rows=261011
) before it can order by and pick the top 5. This is going to be expensive, even after you have fixed other problems as discussed. It's a K-nearest-neighbour (KNN) problem by nature and there are solutions for related cases. But I cannot think of a general solution for your case, since the rank calculation itself depends on user input. I would try to eliminate the bulk of low ranking matches early so that the full calculation only has to be done for few good candidates.
One way I can think of is to combine your fulltext search with trigram similarity search - which offers a working implementation for the KNN problem. This way you can pre-select the "best" matches with LIKE
predicate as candidates (in a subquery with LIMIT 50
for example) and then pick the 5 top-ranking rows according to your rank-calculation in the main query.
Or apply both predicates in the same query and pick the closest matches according to trigram similarity (which would produce different results) like in this related answer:
I did some more research and you are not the first to run into this problem. Related posts on pgsql-general:
Work is ongoing to eventually implement a tsvector <-> tsquery
operator.
Oleg Bartunov and Alexander Korotkov even presented a working prototype (using ><
as operator instead of <->
back then) but it's very complex to integrate in Postgres, the whole infrastructure for GIN indexes has to be reworked (most of which is done by now).
Major problem 3: weights and index
And I identified one more factor adding to the slowness of the query. Per documentation:
GIN indexes are not lossy for standard queries, but their performance
depends logarithmically on the number of unique words. (However, GIN
indexes store only the words (lexemes) of tsvector
values, and not
their weight labels. Thus a table row recheck is needed when using a
query that involves weights.)
Bold emphasis mine. As soon as weight are involved, each row has to be fetched from the heap (not just a cheap visibility check) and long values have to be de-toasted, which adds to the cost. But there seems to be a solution for that:
Index definition
Looking at your index again, it doesn't seem to make sense to begin with. You assign a weight to a single column, which is meaningless as long as you don't concatenate other columns with a different weight.
COALESCE()
also makes no sense as long as you don't actually concatenate more columns.
Simplify your index:
CREATE INDEX "File_contentIndex" ON "File" USING gin
(to_tsvector('english', "CONTENT");
And your query:
SELECT "ITEMID", ts_rank(to_tsvector('english', "CONTENT")
, plainto_tsquery('english', 'searchTerm')) AS rank
FROM "File"
WHERE to_tsvector('english', "CONTENT")
@@ plainto_tsquery('english', 'searchTerm')
ORDER BY rank DESC
LIMIT 5;
Still expensive for a search term that matches every row, but probably much less.
Asides
All of these issues combined, the insane cost of 520 seconds for your second query is beginning to make sense. But there still may be more problems. Did you configure your server?
All the usual advice for performance optimization applies.
It makes your life easier if you don't work with double-quotes CaMeL-case identifiers:
Best Answer
Have you considered scanning the texts as they are input and flagging them (possibly with a many-to-many link to another table containing your list of keywords)?