Postgresql – Improve the performance of ST_Within boundary lookups in PostGIS

performancepostgispostgresql

I'm trying to improve the performance of a georeferencing feature in our application. For every address in our system, we need to store which of several boundary types it is located within (local government area, electoral boundaries and so on).

The lookup process I'm using seems very slow for what it is. In this test dataset of 690 rows, and 150 relevant boundaries, the query takes a bit over 5 seconds on my iMac:

UPDATE testdata
SET ELB = (
    SELECT gc.id      
    FROM geo_choice gc        
    JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id      
    WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)
      LIMIT 1
)

This version takes just over 9 seconds:

UPDATE testdata
SET ELB = gc.id
FROM geo_choice gc
JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id
WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)

And this version, using CROSS LATERAL (inspired by this blog post), is about 5.5:

UPDATE testdata t1
SET ELB = gc.id
FROM testdata t2
CROSS JOIN LATERAL (
    SELECT gc.id      
    FROM geo_choice gc        
    JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id      
    WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(t2.lon, t2.lat), gc.geom)
    LIMIT 1
) AS gc

There is a spatial index on the geo_choice table:

CREATE INDEX geo_choice_gix ON public.geo_choice USING gist (geom) TABLESPACE pg_default;

I can confirm the index is being used: if I used _ST_Within, to negate it, the query takes over 4 minutes.

Is there a faster way to do this kind of boundary lookup, or something else I'm missing?

Best Answer

I had a similar issue, I find a faster way (30% than using joins an ST_Within in my case) using python with the shapely package, shapely has a method called .contains that works similar to ST_Within. I'm not sure if .contains is faster than ST_Within but using a scripting language allows you to split the processes at different times, and that's the key point from my understanding.

These would be the steps I done:

  1. Store all geometries as a huge .json or several .json files (in my case I have about 15,000 geometries divided into about 500 .json files), you should find fast the geometry or the groups of geometries you want using a key. You could store a geo_json or the hexadecimal, see documentation.
  2. Iterate over all geometries in the .json and keep it in memory as shape, maybe as a list() of dict() making a relation like [{"id":5, "shape": <shape object ..>}]. In this step you have all relevant geometries in memory and you could access them instantly using a key or keys. The limitation here is the memory available.
  3. Iterate over points (as Point) and inside iterate over precalculated shapes using .contains. Be careful because we are comparing all points to all geometries.
  4. Store the result you want, depending on the implementation. This is one connection and one query (insert, update..).

Step 1 should be run only once or every time there is an update on any geometry.

Step 2 should be executed at the beginning of the script only once outside the loops.

Step 3: Depending on the exact implementation this needs to be improved, for example you should reduce as much as possible the geometries to iterate, maybe at first match you could break the iterations.