I'm trying to improve the performance of a georeferencing feature in our application. For every address in our system, we need to store which of several boundary types it is located within (local government area, electoral boundaries and so on).
The lookup process I'm using seems very slow for what it is. In this test dataset of 690 rows, and 150 relevant boundaries, the query takes a bit over 5 seconds on my iMac:
UPDATE testdata
SET ELB = (
SELECT gc.id
FROM geo_choice gc
JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id
WHERE gcl.name = 'FEDERAL_ELECTORATE'
AND gc.geo_choice_list_version_id = gcl.current_version
AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)
LIMIT 1
)
This version takes just over 9 seconds:
UPDATE testdata
SET ELB = gc.id
FROM geo_choice gc
JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id
WHERE gcl.name = 'FEDERAL_ELECTORATE'
AND gc.geo_choice_list_version_id = gcl.current_version
AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)
And this version, using CROSS LATERAL (inspired by this blog post), is about 5.5:
UPDATE testdata t1
SET ELB = gc.id
FROM testdata t2
CROSS JOIN LATERAL (
SELECT gc.id
FROM geo_choice gc
JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id
WHERE gcl.name = 'FEDERAL_ELECTORATE'
AND gc.geo_choice_list_version_id = gcl.current_version
AND ST_Within(ST_Point(t2.lon, t2.lat), gc.geom)
LIMIT 1
) AS gc
There is a spatial index on the geo_choice table:
CREATE INDEX geo_choice_gix ON public.geo_choice USING gist (geom) TABLESPACE pg_default;
I can confirm the index is being used: if I used _ST_Within, to negate it, the query takes over 4 minutes.
Is there a faster way to do this kind of boundary lookup, or something else I'm missing?
Best Answer
I had a similar issue, I find a faster way (30% than using
joins
anST_Within
in my case) using python with the shapely package, shapely has a method called .contains that works similar toST_Within
. I'm not sure if.contains
is faster thanST_Within
but using a scripting language allows you to split the processes at different times, and that's the key point from my understanding.These would be the steps I done:
list()
ofdict()
making a relation like[{"id":5, "shape": <shape object ..>}]
. In this step you have all relevant geometries in memory and you could access them instantly using a key or keys. The limitation here is the memory available..contains
. Be careful because we are comparing all points to all geometries.Step 1 should be run only once or every time there is an update on any geometry.
Step 2 should be executed at the beginning of the script only once outside the loops.
Step 3: Depending on the exact implementation this needs to be improved, for example you should reduce as much as possible the geometries to iterate, maybe at first match you could break the iterations.