PostGIS Indexing – How to Speed Up Query

indexindex-tuningpattern matchingpostgispostgresql

I have this query (called from PHP, hence the $ variables):

SELECT a.name,
      St_AsText(a.geom), 
      b.name AS name_b,
      b.id
FROM a
LEFT JOIN b ON b.id = a.b_id
WHERE UNACCENT(LOWER(a.name)) LIKE UNACCENT('$searched%')
ORDER BY a.geom <-> 'SRID=4326;POINT($lat $lon)'::geometry         
LIMIT 2;

There is lower(a.name) BTree index on a.name, GIST index on a.geom and b.id is primary key

It runs about 40ms for $searched = "Maria%". Now, if I set $searched = "Mariasomething some%", time goes up to 2s.

Why? Can it be speed up or perform in a more stable manner?

Best Answer

Explain

We would have to see table definition, cardinalities and the EXPLAIN output to be certain, but the reason is most likely this:

Only your spatial GiST index on a.geom can be used. The btree index is not applicable. Postgres walks through the "closest" rows until it finds the first two matching your predicate.

Normally, more restrictive conditions make queries faster, since fewer rows have to be fetched (and sorted). But not in this constellation: Fewer qualifying rows don't help a nearest neighbour search with a small LIMIT - au contrair. There are two options:

  1. The "nearest neighbour" search based on the GiST index - which is fast as long as some of the closest rows qualify - fastest if you remove the WHERE clause altogether, try it! It gets increasingly expensive if many rows have to be filtered. Basically: the longer your pattern the more expensive the query.

  2. A sequential scan, which has to solve the same problem for selective conditions: filter many rows until finding some candidates. The method is faster, but the whole table has to be read, and if more than a few rows are found, the final sort by distance gets expensive.

Basically, Postgres is stuck between a rock and a hard place with this wicked query.

Fix

You would need an index on unaccent(lower(name)) which is not possible, because unaccent() is only STABLE, not IMMUTABLE. But you can work around this limitation with an IMMUTABLE function wrapper - done right:

The best index still seems tricky and depends on your complete situation.

Since you match left-anchored patterns, I would try a separate btree index with the text_pattern_ops operator class, using the function f_unaccent() outlined in the linked answer above:

CREATE INDEX a_name_pattern_idx ON a (f_unaccent(lower(name)) text_pattern_ops);

More:

Then:

  • Rewrite the WHERE condition to match the indexed expression
  • Make the wildcard to the right of the expression explicit to make it work for prepared statements.
  • Don't concatenation text parameters into your query to begin with, that's an invitation for sneaky errors and SQL injection. Use a prepared statement:

$sql = 'SELECT a.name, ST_AsText(a.geom), b.name AS name_b, b.id
        FROM   a
        LEFT   JOIN b ON b.id = a.b_id
        WHERE  f_unaccent(lower(a.name)) LIKE (f_unaccent(lower($1)) || $$%$$)
        ORDER  BY a.geom <-> ST_GeomFromText($$POINT($2 $3)$$, 4326)
        LIMIT  2';

$result = pg_query_params($sql, array($searched, $lat, $lon));

I use dollar-quoting to avoid escaping single quotes.

For repeated calls use pg_prepare and pg_execute. Details:

Or you could try a GiST index - a multicolumn, functional GiST index combining a PostGis spatial GiST index and a functional trigram GiST index. I didn't test that combination, yet.

You need to install pg_trgm first. Instructions:

CREATE INDEX a_special_gist_idx ON a
USING gist (f_unaccent(lower(name)) gist_trgm_ops, geom);

I put f_unaccent(lower(name)) first because, per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.