I have a multi-TB table with geolocation data for a certain item id and timestamp.
It is partitioned by time with a partition for every month (~500GB), and clustered (including the partitions) on the item id and time index.
CREATE TABLE data(
itemid integer not null,
ts timestamp not null,
pos geometry(Point,4326) not null,
otherdata varchar not null -- Stand in for other columns
) partition by range(ts);
CREATE INDEX data_cluster ON data (itemid, ts);
CREATE INDEX data_brin ON data USING brin (itemid, ts, pos) with (pages_per_range = 8);
CREATE TABLE data_201901 PARTITION OF data
FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');
CREATE TABLE data_201902 PARTITION OF data
FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');
This performs very well for our most common queries, where the data of one or a small number of items is queried for a range of time (several hours to days).
A secondary type of query we would like to do is all items in or near a location in a certain time range. Our intention is that the brin index can be used for this since (item_id,ts)
is highly correlated with pos
.
Something like this:
select itemid, ts, pos, otherdata from data
where ts >= '2019-02-02 00:00:00' AND ts < '2019-02-03 14:00:00'
AND st_within(pos, st_setsrid(
st_geomfromgeojson('{"coordinates":"coordinates":[[[-74.05059814453125,40.6920928987952],[-73.92013549804688,40.6920928987952],[-73.92013549804688,40.79405848578324],[-74.05059814453125,40.79405848578324],[-74.05059814453125,40.6920928987952]]],"type": "Polygon"}'),
4326
) :: geometry)
We see however that the normal btree index is used for this, and the queries take too long to be useful (multiple minutes to over an hour)
What is a way to enable these queries?
- Is there a combined time and geospatial index?
- Is there something that can be done to make the brin index being used?
- Is the best way partitioning on a smaller timeframe like a day and creating a geo index for each partition?
edit on request of eppesuig, an example execution plan of the above query:
Gather (cost=1000.70..71744000.35 rows=1933 width=44) (actual time=164868.705..1227645.633 rows=65701 loops=1)
Output: data_201902.itemid, data_201902.ts, data_201902.pos, data_201902.otherdata
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
I/O Timings: read=2117627.130 write=2645.225
-> Parallel Append (cost=0.70..71742807.05 rows=805 width=44) (actual time=164838.218..1227566.491 rows=21900 loops=3)
Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
I/O Timings: read=2117627.130 write=2645.225
Worker 0: actual time=164823.732..1227566.470 rows=23522 loops=1
Buffers: shared hit=6011928 read=4078259 dirtied=81890 written=85133
I/O Timings: read=709048.959 write=873.459
Worker 1: actual time=164824.015..1227556.922 rows=22578 loops=1
Buffers: shared hit=5974480 read=4022282 dirtied=82190 written=84063
I/O Timings: read=702613.742 write=856.134
-> Parallel Index Scan using data_201902_cluster on public.data_201902 (cost=0.70..71742803.02 rows=805 width=44) (actual time=164838.216..1227552.572 rows=21900 loops=3)
Output: data_201902.itemid, data_201902.ts, data_201902.pos
Index Cond: ((data_201902.ts >= '2019-02-01 00:00:00'::timestamp without time zone) AND (data_201902.ts < '2019-02-03 00:00:00'::timestamp without time zone))
Filter: (('010...4540'::geometry, data_201902.pos))
Rows Removed by Filter: 40950205
Buffers: shared hit=18009528 read=12157953 dirtied=246522 written=253711
I/O Timings: read=2117627.130 write=2645.225
Worker 0: actual time=164823.730..1227551.115 rows=23522 loops=1
Buffers: shared hit=6011928 read=4078259 dirtied=81890 written=85133
I/O Timings: read=709048.959 write=873.459
Worker 1: actual time=164824.013..1227542.124 rows=22578 loops=1
Buffers: shared hit=5974480 read=4022282 dirtied=82190 written=84063
I/O Timings: read=702613.742 write=856.134
Planning Time: 5.442 ms
Execution Time: 1227669.990 ms
Best Answer
You can create a multi-column index on spatial and time data.
With GiST, you generally want the column which is used more-selectively in the query to be first in the index. I suspect that that is "ts", but I don't know for sure. You might want to try it both ways.
A multicolumn BRIN index is basically the same thing as multiple independent single-column BRIN indexes. If your data is clustered on "item_id", then the "ts" column in the BRIN index is unlikely to be useful, unless "item_id" and "ts" are highly correlated with each other.
Maybe, but that would not be my first resort. I'd try the
USING GIST (ts,pos)
index first. And it looks like you haven't even tried aUSING GIST (pos)
on your existing partitioning yet. I wouldn't go around changing the partitioning without first trying what you already have.