PostGIS with PostgreSQL – Joining Tables Based on Time and Distance

performancepostgispostgresqlquery-performancespatialtimestamp

I have a table with several thousand rows that contains time and position data. Rows with the same group_id have the same timestamp. Lets call this table "Small":

Small Table:
id  event_time                   group_id   item_id  position
1   '2018-06-21 18:35:01.631094+00' '123a'  1   '01010230...'
2   '2018-06-21 18:35:01.630881+00' '123a'  2   '01010044...'
3   '2018-06-18 10:35:01.630663+00' '321b'  1   '01015600...'
4   '2018-06-18 10:35:01.630305+00' '321b'  2   '01010031...'

I have another table (Table "Big") that has similar data columns (time, position, data1, data2 etc). The timestamps in this table continuous and are overlapping over the first table and there are 80+ million rows:

Big Table:
id  event_time                       Data1   position
1   '2018-06-21 18:45:01.631094+00' 'john'  '01013000...'
2   '2018-06-21 18:41:01.630881+00' 'joe'   '01016000...'
3   '2018-06-21 18:33:01.630663+00' 'john'  '01017000...'
4   '2018-06-21 18:30:01.630305+00' 'rory'  '01018000...'

I have geospatial and time indexes on the two tables.

What I would like to do is find the nearest matches between Big and Small and return data1, data2 and the differences in space and time. In short I'd like to find out that "john" is the best match to group '123a', item 2 and he was 100m and 2 minutes from it but 'rory' was closest to item 1 (5 minutes and 1 km or whatever).

I've tried a command similar to this but it is way too slow. It doesn't seem to be using the indexes.

SELECT 
  big.id, 
  small.id,
  st_distance(big.position, small.position) as pos_delta,
  (big.event_time, small.event_time) as time_delta,
  big.data1,
  small.item_id
FROM big, small 
WHERE
  (big.event_time - small.event_time) < '2 hours'
ORDER BY login_sar_vessel.position <-> login_pos_report.position
LIMIT 1

Is there a way to possible first select data from big table for 2 hours around the timestamp associated with the group_id, find the big rows with the smallest (big.position to small.position) distance and then repeat for each group_id? That seems a bit messy.

Oh, and the DB is postgres 9.6 with postgis 2.4.

Best Answer

I think you want something like this,

SELECT
  big.id,
  small.id,
  ST_Distance(big.geom, small.geom),
  big.event_time <-> small.event_time
FROM small
CROSS JOIN LATERAL (
  SELECT *
  FROM big
  WHERE (small.event_time <-> big.event_time) < '2 hours'::interval
  ORDER BY small.geom <-> big.geom
  OFFSET 0
  LIMIT 1
) AS big;

You can put both of these on both timestmap and your geom for extra joy using btree_gist

CREATE EXTENSION btree_gist;

CREATE INDEX ON big USING gist (event_time, geom);
CREATE INDEX ON small USING gist (event_time, geom);
VACUUM FULL ANALYZE big;
VACUUM FULL ANALYZE small;

Note: I would upgrade to PostgreSQL 10 so you'd have parallel stuff, as it'll make a major difference for this work load.

You may also consider BRIN indexes on the timestamp, and clustering by timestamp. And after that's done going to GIS. Or even creating a new indexed MATERIALIZED VIEW with only the last week of data or whatever.