Postgresql – Improve the performance of ST_Within boundary lookups in PostGIS

performancepostgispostgresql

I'm trying to improve the performance of a georeferencing feature in our application. For every address in our system, we need to store which of several boundary types it is located within (local government area, electoral boundaries and so on).

The lookup process I'm using seems very slow for what it is. In this test dataset of 690 rows, and 150 relevant boundaries, the query takes a bit over 5 seconds on my iMac:

UPDATE testdata
SET ELB = (
    SELECT gc.id      
    FROM geo_choice gc        
    JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id      
    WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)
      LIMIT 1
)

This version takes just over 9 seconds:

UPDATE testdata
SET ELB = gc.id
FROM geo_choice gc
JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id
WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(testdata.lon, testdata.lat), gc.geom)

And this version, using CROSS LATERAL (inspired by this blog post), is about 5.5:

UPDATE testdata t1
SET ELB = gc.id
FROM testdata t2
CROSS JOIN LATERAL (
    SELECT gc.id      
    FROM geo_choice gc        
    JOIN geo_choice_list gcl ON gc.geo_choice_list_id = gcl.id      
    WHERE gcl.name = 'FEDERAL_ELECTORATE'
      AND gc.geo_choice_list_version_id = gcl.current_version        
      AND ST_Within(ST_Point(t2.lon, t2.lat), gc.geom)
    LIMIT 1
) AS gc

There is a spatial index on the geo_choice table:

CREATE INDEX geo_choice_gix ON public.geo_choice USING gist (geom) TABLESPACE pg_default;

I can confirm the index is being used: if I used _ST_Within, to negate it, the query takes over 4 minutes.

Is there a faster way to do this kind of boundary lookup, or something else I'm missing?

Best Answer

I had a similar issue, I find a faster way (30% than using joins an ST_Within in my case) using python with the shapely package, shapely has a method called .contains that works similar to ST_Within. I'm not sure if .contains is faster than ST_Within but using a scripting language allows you to split the processes at different times, and that's the key point from my understanding.

These would be the steps I done:

Store all geometries as a huge .json or several .json files (in my case I have about 15,000 geometries divided into about 500 .json files), you should find fast the geometry or the groups of geometries you want using a key. You could store a geo_json or the hexadecimal, see documentation.
Iterate over all geometries in the .json and keep it in memory as shape, maybe as a list() of dict() making a relation like [{"id":5, "shape": <shape object ..>}]. In this step you have all relevant geometries in memory and you could access them instantly using a key or keys. The limitation here is the memory available.
Iterate over points (as Point) and inside iterate over precalculated shapes using .contains. Be careful because we are comparing all points to all geometries.
Store the result you want, depending on the implementation. This is one connection and one query (insert, update..).

Step 1 should be run only once or every time there is an update on any geometry.

Step 2 should be executed at the beginning of the script only once outside the loops.

Step 3: Depending on the exact implementation this needs to be improved, for example you should reduce as much as possible the geometries to iterate, maybe at first match you could break the iterations.

PERSPECTIVE #1

When you update by Primary Key only in InnoDB, there is a rare but possible occasion when the clustered index (aka gen_clust_index) can get locked.

I once answered three posts from one individual on this subject

Please read through these carefully. The poster of these question found his own workaround based on the seeing InnoDB Clustered Index Locking behavior. Sadly, he did not post what the workaround was.

In addition, when you see the queries running slow, log into mysql and run SHOW ENGINE INNODB STATUS\G and starting looking for locks in the Clustered Index.

PERSPECTIVE #2

I see you commented out innodb_log_file_size. You have it at 5MB, the default. Since innodb_buffer_pool_size is set to 1G, innodb_log_file_size needs to be at 256M. Click here to go about setting innodb_log_file_size to 256M.

PERSPECTIVE #3

I see you are not using innodb_file_per_table. You may want to use it in order to have table updates done specifically for the one table with a million rows. Click here to see how to Clean Up InnoDB infrastructure to use innodb_file_per_table.

MySQL Performance – Why Int Comparison is Very Slow in Query

First, a query or (update) statement with a condition like WHERE posts < comments that compares 2 columns cannot effectively use your indexes so it will probably have to do a full table scan. It might be better if you had a composite index, (posts, comments) or the other way around, but it would still need to do a full index scan.

If the rows to be updated are very few, say 100, it's not very efficient to scan 1 million rows to update only 100.

So, the easiest thing would probably be to add such a composite index and test. If the resulting efficiency from the full index scan is acceptable, you can keep using this query.

Another thing you could do - since you are using MariaDB - is to add a VIRTUAL (computed) column and index it:

ALTER TABLE posts
  ADD COLUMN comment_posts_diff INT AS (columns - posts) PERSISTENT,
  ADD INDEX more_comments_ix (comment_posts_diff) ;

Then you only need to change the condition of your queries to use:

WHERE comments - posts > 0

or:

WHERE comment_posts_diff > 0

and the index will be effectively used.

Note that there are 2 variants of computed columns, the VIRTUAL and the PERSISTENT ones. As it's obvious from the names, the persistent columns actually take space on disks and the values are automatically modified when any of the columns in their definition is modified. Their advantage is that they can be indexed.

If the 'more comments' though is just a constant string that applies to rows with comments > posts and there is another string ('less comment'?) that applies to the rest of the rows or a similar need, you don't really need the column notes and any of the above. Your queries could just compute the respective value (more comments, less comment, no comments) at run time.

Or you could have a computed column to do the work for you (vitual this time, not persistent):

ALTER TABLE posts
  DROP COLUMN notes,
  ADD COLUMN notes VARCHAR(20) AS
      CASE WHEN columns > posts THEN 'more comments'
           WHEN columns = 0     THEN 'no comments'
           WHEN columns = posts THEN 'comments = posts'
           ELSE 'less comments'
      END
    VIRTUAL ;

Best Answer

Related Solutions

Mysql – Innodb UPDATE performance

PERSPECTIVE #1

PERSPECTIVE #2

PERSPECTIVE #3

MySQL Performance – Why Int Comparison is Very Slow in Query

Related Question