Postgresql – Trigram search gets much slower as search string gets longer

full-text-searchpattern matchingpostgresqlpostgresql-9.1postgresql-9.4

In a Postgres 9.1 database, I have a table table1 with ~1.5M rows and a column label (simplified names for the sake of this question).

There is a functional trigram-index on lower(unaccent(label)) (unaccent() has been made immutable to allow its use in the index).

The following query is quite fast:

SELECT count(*) FROM table1
WHERE (lower(unaccent(label)) like lower(unaccent('%someword%')));
 count 
-------
     1
(1 row)

Time: 394,295 ms

But the following query is slower:

SELECT count(*) FROM table1
WHERE (lower(unaccent(label)) like lower(unaccent('%someword and some more%')));
 count 
-------
     1
(1 row)

Time: 1405,749 ms

And adding more words is even slower, even though the search is stricter.

I tried a simple trick to run a subquery for the first word and then a query with the full search string, but (sadly) the query planner saw through my machinations:

EXPLAIN ANALYZE
SELECT * FROM (
   SELECT id, title, label from table1
   WHERE lower(unaccent(label)) like lower(unaccent('%someword%'))
   ) t1
WHERE lower(unaccent(label)) like lower(unaccent('%someword and some more%'));

Bitmap Heap Scan on table1  (cost=16216.01..16220.04 rows=1 width=212) (actual time=1824.017..1824.019 rows=1 loops=1)
  Recheck Cond: ((lower(unaccent((label)::text)) ~~ '%someword%'::text) AND (lower(unaccent((label)::text)) ~~ '%someword and some more%'::text))
  ->  Bitmap Index Scan on table1_label_hun_gin_trgm  (cost=0.00..16216.01 rows=1 width=0) (actual time=1823.900..1823.900 rows=1 loops=1)
        Index Cond: ((lower(unaccent((label)::text)) ~~ '%someword%'::text) AND (lower(unaccent((label)::text)) ~~ '%someword and some more%'::text))
Total runtime: 1824.064 ms

My ultimate problem is that the search string comes from a web interface which may send quite long strings and thus be quite slow and may also constitute a DOS vector.

So my questions are:

How to speed up the query?
Is there a way to break it into subqueries so that it is faster?
Maybe a later version of Postgres is better? (I tried 9.4 and it does not seem faster: still the same effect. Maybe a later version?)
Maybe a different indexing strategy is needed?

Best Answer

In PostgreSQL 9.6 there will be a new version of pg_trgm, 1.2, which will be much better about this. With a little effort, you can also get this new version to work under PostgreSQL 9.4 (you have to apply the patch, and compile the extension module yourself and install it).

What the oldest version does is search for each trigram in the query and take the union of them, and then apply a filter. What the new version will do is pick the rarest trigram in the query and search for just that one, and then filter on the rest later.

The machinery to do this does not exist in 9.1. In 9.4 that machinery was added, but pg_trgm wasn't adapted to make use of it at that time.

You would still have a potential DOS issue, as the malicious person can craft a query which has only common trigrams. like '%and%', or even '%a%'

If you can't upgrade to pg_trgm 1.2, then another way to trick the planner would be:

WHERE (lower(unaccent(label)) like lower(unaccent('%someword%'))) 
AND   (lower(unaccent(label||'')) like 
      lower(unaccent('%someword and some more%')));

By concatenating the empty string to label, you trick the planner into thinking it can't use the index on that part of the where clause. So it uses the index on just the %someword%, and applies a filter to just those rows.

Also, if you are always searching for entire words, you could use a function to tokenize the string into an array of words, and use a regular built-in GIN index (not pg_trgm) on that array-returning function.

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to index WHERE (start_date >= ‘2013-12-15’)

Related Question