PostgreSQL Performance – Indexes for Combined Time and Location Data

performancepostgispostgresqlpostgresql-performance

I have a multi-TB table with geolocation data for a certain item id and timestamp.

It is partitioned by time with a partition for every month (~500GB), and clustered (including the partitions) on the item id and time index.

CREATE TABLE data(
  itemid integer not null,
  ts timestamp not null,
  pos geometry(Point,4326) not null,
  otherdata varchar not null -- Stand in for other columns
) partition by range(ts);

CREATE INDEX data_cluster ON data (itemid, ts);
CREATE INDEX data_brin ON data USING brin (itemid, ts, pos) with (pages_per_range = 8);

CREATE TABLE data_201901 PARTITION OF data
    FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');

CREATE TABLE data_201902 PARTITION OF data
    FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');

This performs very well for our most common queries, where the data of one or a small number of items is queried for a range of time (several hours to days).

A secondary type of query we would like to do is all items in or near a location in a certain time range. Our intention is that the brin index can be used for this since (item_id,ts) is highly correlated with pos.

Something like this:

select itemid, ts, pos, otherdata from data
where ts >= '2019-02-02 00:00:00' AND ts < '2019-02-03 14:00:00'
AND st_within(pos,  st_setsrid(
                st_geomfromgeojson('{"coordinates":"coordinates":[[[-74.05059814453125,40.6920928987952],[-73.92013549804688,40.6920928987952],[-73.92013549804688,40.79405848578324],[-74.05059814453125,40.79405848578324],[-74.05059814453125,40.6920928987952]]],"type": "Polygon"}'),
                4326
              ) :: geometry)

We see however that the normal btree index is used for this, and the queries take too long to be useful (multiple minutes to over an hour)

What is a way to enable these queries?

  • Is there a combined time and geospatial index?
  • Is there something that can be done to make the brin index being used?
  • Is the best way partitioning on a smaller timeframe like a day and creating a geo index for each partition?

edit on request of eppesuig, an example execution plan of the above query:

 Gather  (cost=1000.70..71744000.35 rows=1933 width=44) (actual time=164868.705..1227645.633 rows=65701 loops=1)
   Output: data_201902.itemid, data_201902.ts, data_201902.pos, data_201902.otherdata
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
   I/O Timings: read=2117627.130 write=2645.225
   ->  Parallel Append  (cost=0.70..71742807.05 rows=805 width=44) (actual time=164838.218..1227566.491 rows=21900 loops=3)
         Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
         I/O Timings: read=2117627.130 write=2645.225
         Worker 0: actual time=164823.732..1227566.470 rows=23522 loops=1
           Buffers: shared hit=6011928 read=4078259 dirtied=81890 written=85133
           I/O Timings: read=709048.959 write=873.459
         Worker 1: actual time=164824.015..1227556.922 rows=22578 loops=1
           Buffers: shared hit=5974480 read=4022282 dirtied=82190 written=84063
           I/O Timings: read=702613.742 write=856.134
         ->  Parallel Index Scan using data_201902_cluster on public.data_201902  (cost=0.70..71742803.02 rows=805 width=44) (actual time=164838.216..1227552.572 rows=21900 loops=3)
               Output: data_201902.itemid, data_201902.ts, data_201902.pos
               Index Cond: ((data_201902.ts >= '2019-02-01 00:00:00'::timestamp without time zone) AND (data_201902.ts < '2019-02-03 00:00:00'::timestamp without time zone))
               Filter: (('010...4540'::geometry, data_201902.pos))
               Rows Removed by Filter: 40950205
               Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
               I/O Timings: read=2117627.130 write=2645.225
               Worker 0: actual time=164823.730..1227551.115 rows=23522 loops=1
                 Buffers: shared hit=6011928 read=4078259 dirtied=81890 written=85133
                 I/O Timings: read=709048.959 write=873.459
               Worker 1: actual time=164824.013..1227542.124 rows=22578 loops=1
                 Buffers: shared hit=5974480 read=4022282 dirtied=82190 written=84063
                 I/O Timings: read=702613.742 write=856.134
 Planning Time: 5.442 ms
 Execution Time: 1227669.990 ms

Best Answer

You can create a multi-column index on spatial and time data.

create extension btree_gist;
create index on data using gist (ts, pos);

With GiST, you generally want the column which is used more-selectively in the query to be first in the index. I suspect that that is "ts", but I don't know for sure. You might want to try it both ways.

A multicolumn BRIN index is basically the same thing as multiple independent single-column BRIN indexes. If your data is clustered on "item_id", then the "ts" column in the BRIN index is unlikely to be useful, unless "item_id" and "ts" are highly correlated with each other.

Is the best way partitioning on a smaller timeframe like a day and creating a geo index for each partition?

Maybe, but that would not be my first resort. I'd try the USING GIST (ts,pos) index first. And it looks like you haven't even tried a USING GIST (pos) on your existing partitioning yet. I wouldn't go around changing the partitioning without first trying what you already have.