Postgresql – Geospatial join using ST_Contains or ST_Within seems slow, with costly query and bad estimate

postgispostgresqlpostgresql-12postgresql-performance

Problem

I have two tables, one which which includes a point column and the other which includes a boundary (table definitions below). I want to select rows from the first column where the point is within the boundary in the latter column. Writing a SELECT query using what seems like a pretty straight forward join like so ends up being quite slow.

SELECT temp_locations.id
FROM temp_locations
INNER JOIN temp_boundaries
ON st_within(temp_locations.point, temp_boundaries.mpoly)
WHERE temp_boundaries.id = 11;

From a table with just over 120k rows, this takes almost 12s to return fewer than 70k rows (the exact timing isn't important here, but this is running on a new, amply powered developer machine).

The point gemoetry field is indexed, as is the mpoly geometry.

The boundaries are not trivial, they are US state boundaries, but even still this seems… very surprising to me.

I've tried both ST_Within and ST_Contains, expecting no difference but curious none the less (no difference!). The existing documentation and posts I've found about this all seem to point to doing exactly what this query does.

Am I missing something obvious here? Is this a function of using multipolygon geometry, and for filtering this should be simplified? Or is the expected and reasonable performance for this kind of query?

Tables and indexes

CREATE TABLE temp_boundaries (
    id integer,
    uuid uuid,
    mpoly geometry(MultiPolygon,4326),
    name character varying(255)
);
CREATE TABLE temp_locations (
    id integer,
    point geometry(Point,4326),
    street_address character varying(255),
    street_address_2 character varying(255),
    city character varying(255),
    state character varying(2),
    zip_code character varying(10)
);


CREATE UNIQUE INDEX temp_location_pkey ON temp_locations(id int4_ops);
CREATE UNIQUE INDEX temp_boundary_pkey ON temp_boundaries(id int4_ops);


CREATE INDEX temp_location_point_id ON temp_locations USING GIST (point gist_geometry_ops_2d);
CREATE INDEX temp_boundary_mpoly_id ON temp_boundaries USING GIST (mpoly gist_geometry_ops_2d);

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

QUERY PLAN
Nested Loop  (cost=5.49..3530.86 rows=3080 width=4) (actual time=18.877..11298.389 rows=67498 loops=1)
  Output: temp_locations.id
  Buffers: shared hit=6157153
  ->  Index Scan using temp_boundary_pkey on public.temp_boundaries  (cost=0.29..8.30 rows=1 width=51649) (actual time=0.019..0.021 rows=1 loops=1)
        Output: temp_boundaries.mpoly
        Index Cond: (temp_boundaries.id = 11)
        Buffers: shared hit=3
  ->  Bitmap Heap Scan on public.temp_locations  (cost=5.21..3522.44 rows=12 width=36) (actual time=18.853..11277.169 rows=67498 loops=1)
        Output: temp_locations.id, temp_locations.point
        Filter: st_within(temp_locations.point, temp_boundaries.mpoly)
        Rows Removed by Filter: 4884
        Heap Blocks: exact=4072
        Buffers: shared hit=6157150
        ->  Bitmap Index Scan on temp_location_point_id  (cost=0.00..5.20 rows=123 width=0) (actual time=16.476..16.476 rows=72382 loops=1)
              Index Cond: (temp_locations.point @ temp_boundaries.mpoly)
              Buffers: shared hit=523
Planning Time: 2.262 ms
Execution Time: 11309.728 ms

Best Answer

PostgreSQL chooses the optimal plan and uses your index.

The time is probably spent in the 72382 executions of st_within that are required to verify that the rows found by the index actually satisfy the condition (the index operates with bounding boxes and consequently can return false positives, in your case 4884 of them).

The bitmap heap scan has to read 6 million 8kB blocks, which can also be an explanation for the long execution time. I don't know why, because points shouldn't be large. One possible explanation might be lots of dead tuples - does VACUUM temp_locations; help?