PostGIS Indexing – How to Speed Up Query

indexindex-tuningpattern matchingpostgispostgresql

I have this query (called from PHP, hence the $ variables):

SELECT a.name,
      St_AsText(a.geom), 
      b.name AS name_b,
      b.id
FROM a
LEFT JOIN b ON b.id = a.b_id
WHERE UNACCENT(LOWER(a.name)) LIKE UNACCENT('$searched%')
ORDER BY a.geom <-> 'SRID=4326;POINT($lat $lon)'::geometry         
LIMIT 2;

There is lower(a.name) BTree index on a.name, GIST index on a.geom and b.id is primary key

It runs about 40ms for $searched = "Maria%". Now, if I set $searched = "Mariasomething some%", time goes up to 2s.

Why? Can it be speed up or perform in a more stable manner?

Best Answer

Explain

We would have to see table definition, cardinalities and the EXPLAIN output to be certain, but the reason is most likely this:

Only your spatial GiST index on a.geom can be used. The btree index is not applicable. Postgres walks through the "closest" rows until it finds the first two matching your predicate.

Normally, more restrictive conditions make queries faster, since fewer rows have to be fetched (and sorted). But not in this constellation: Fewer qualifying rows don't help a nearest neighbour search with a small LIMIT - au contrair. There are two options:

The "nearest neighbour" search based on the GiST index - which is fast as long as some of the closest rows qualify - fastest if you remove the WHERE clause altogether, try it! It gets increasingly expensive if many rows have to be filtered. Basically: the longer your pattern the more expensive the query.
A sequential scan, which has to solve the same problem for selective conditions: filter many rows until finding some candidates. The method is faster, but the whole table has to be read, and if more than a few rows are found, the final sort by distance gets expensive.

Basically, Postgres is stuck between a rock and a hard place with this wicked query.

Fix

You would need an index on unaccent(lower(name)) which is not possible, because unaccent() is only STABLE, not IMMUTABLE. But you can work around this limitation with an IMMUTABLE function wrapper - done right:

Does PostgreSQL support “accent insensitive” collations?

The best index still seems tricky and depends on your complete situation.

Since you match left-anchored patterns, I would try a separate btree index with the text_pattern_ops operator class, using the function f_unaccent() outlined in the linked answer above:

CREATE INDEX a_name_pattern_idx ON a (f_unaccent(lower(name)) text_pattern_ops);

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Then:

Rewrite the WHERE condition to match the indexed expression
Make the wildcard to the right of the expression explicit to make it work for prepared statements.
Don't concatenation text parameters into your query to begin with, that's an invitation for sneaky errors and SQL injection. Use a prepared statement:

$sql = 'SELECT a.name, ST_AsText(a.geom), b.name AS name_b, b.id
        FROM   a
        LEFT   JOIN b ON b.id = a.b_id
        WHERE  f_unaccent(lower(a.name)) LIKE (f_unaccent(lower($1)) || $$%$$)
        ORDER  BY a.geom <-> ST_GeomFromText($$POINT($2 $3)$$, 4326)
        LIMIT  2';

$result = pg_query_params($sql, array($searched, $lat, $lon));

I use dollar-quoting to avoid escaping single quotes.

For repeated calls use pg_prepare and pg_execute. Details:

Problems with my attempt to implement an UPSERT

Or you could try a GiST index - a multicolumn, functional GiST index combining a PostGis spatial GiST index and a functional trigram GiST index. I didn't test that combination, yet.

You need to install pg_trgm first. Instructions:

How is LIKE implemented?

CREATE INDEX a_special_gist_idx ON a
USING gist (f_unaccent(lower(name)) gist_trgm_ops, geom);

I put f_unaccent(lower(name)) first because, per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

Related Solutions

PostGIS Schema Qualifier – Best Practices

A schema is like a namespace in programming. It lets you have different things with the same name.

For example, you might have a function foo() in your public schema. If PostGIS has a function foo() as well, that wouldn't work without name qualification.

Postgresql – PostGIS radius query

Best use ST_DWithin() in a LATERAL join:

SELECT t.*, neighbor.*  -- or only selected columns
FROM   tbl t
LEFT   JOIN LATERAL (
   SELECT *
   FROM   tbl t1
   WHERE  ST_DWithin(t.geo_point, t1.geo_point, 5000 * 1.609344)
   ) neighbor ON true
WHERE t.address = 'my_address';  -- to restrict to a particular address

1.609344 being the factor for converting miles to kilometers. Source: Wikipedia. Assuming meter as unit! The documentation:

For Geometries: The distance is specified in units defined by the spatial reference system of the geometries. For this function to make sense, the source geometries must both be of the same coordinate projection, having the same SRID.

For geography units are in meters

I think your SRID 4326 uses degrees, not meters. Instructions in this related answer:

ST_DWithin takes parameter as degree , not meters , why?

You need an index on geo_point to make this fast (used by ST_DWithin() automatically):

CREATE INDEX tbl_geo_point_gist ON tbl USING gist (geo_point);

A GiST index can also be used to identify nearest neighbors (in combination with a maximum number - LIMIT).

Order by distance

Best Answer

Explain

Fix

Related Solutions

PostGIS Schema Qualifier – Best Practices

Postgresql – PostGIS radius query

Related Question