PostGIS with PostgreSQL – Joining Tables Based on Time and Distance

performancepostgispostgresqlquery-performancespatialtimestamp

I have a table with several thousand rows that contains time and position data. Rows with the same group_id have the same timestamp. Lets call this table "Small":

Small Table:
id  event_time                   group_id   item_id  position
1   '2018-06-21 18:35:01.631094+00' '123a'  1   '01010230...'
2   '2018-06-21 18:35:01.630881+00' '123a'  2   '01010044...'
3   '2018-06-18 10:35:01.630663+00' '321b'  1   '01015600...'
4   '2018-06-18 10:35:01.630305+00' '321b'  2   '01010031...'

I have another table (Table "Big") that has similar data columns (time, position, data1, data2 etc). The timestamps in this table continuous and are overlapping over the first table and there are 80+ million rows:

Big Table:
id  event_time                       Data1   position
1   '2018-06-21 18:45:01.631094+00' 'john'  '01013000...'
2   '2018-06-21 18:41:01.630881+00' 'joe'   '01016000...'
3   '2018-06-21 18:33:01.630663+00' 'john'  '01017000...'
4   '2018-06-21 18:30:01.630305+00' 'rory'  '01018000...'

I have geospatial and time indexes on the two tables.

What I would like to do is find the nearest matches between Big and Small and return data1, data2 and the differences in space and time. In short I'd like to find out that "john" is the best match to group '123a', item 2 and he was 100m and 2 minutes from it but 'rory' was closest to item 1 (5 minutes and 1 km or whatever).

I've tried a command similar to this but it is way too slow. It doesn't seem to be using the indexes.

SELECT 
  big.id, 
  small.id,
  st_distance(big.position, small.position) as pos_delta,
  (big.event_time, small.event_time) as time_delta,
  big.data1,
  small.item_id
FROM big, small 
WHERE
  (big.event_time - small.event_time) < '2 hours'
ORDER BY login_sar_vessel.position <-> login_pos_report.position
LIMIT 1

Is there a way to possible first select data from big table for 2 hours around the timestamp associated with the group_id, find the big rows with the smallest (big.position to small.position) distance and then repeat for each group_id? That seems a bit messy.

Oh, and the DB is postgres 9.6 with postgis 2.4.

Best Answer

I think you want something like this,

SELECT
  big.id,
  small.id,
  ST_Distance(big.geom, small.geom),
  big.event_time <-> small.event_time
FROM small
CROSS JOIN LATERAL (
  SELECT *
  FROM big
  WHERE (small.event_time <-> big.event_time) < '2 hours'::interval
  ORDER BY small.geom <-> big.geom
  OFFSET 0
  LIMIT 1
) AS big;

You can put both of these on both timestmap and your geom for extra joy using btree_gist

CREATE EXTENSION btree_gist;

CREATE INDEX ON big USING gist (event_time, geom);
CREATE INDEX ON small USING gist (event_time, geom);
VACUUM FULL ANALYZE big;
VACUUM FULL ANALYZE small;

Note: I would upgrade to PostgreSQL 10 so you'd have parallel stuff, as it'll make a major difference for this work load.

You may also consider BRIN indexes on the timestamp, and clustering by timestamp. And after that's done going to GIS. Or even creating a new indexed MATERIALIZED VIEW with only the last week of data or whatever.

Related Solutions

Postgresql – Problem with shortest distance calculation in PostGIS

Maybe this will help start you out but here is how I am thinking through this.

I can't think of any case, re spherical trig where two straight lines would be closer in the middle than at the edges, unless they cross so I think you can narrow your search down to the vertices of each road plus the nearest point to that vertex on the other road. I am assuming that your roads are modelled as line segments along great circles.

So I don't think that this is a complete solution but, what I would be looking at doing is:

Do these lines intersect? If you assume that all intersections are listed as such in your db, you can assume no.
If so that's the closest point.
If not, then grab the vertices on both roads, draw any lines perpendicular to another line segment on the other, and calculate distance to intersection. Your shortest point will be the minimum of these lines.

That seems the best answer. You have to calculate points. I don;t know of any magic way to do this in PostGIS but it is something where spherical trig may come to the rescue. If the distances are sufficiently small, plane trig may be sufficient as an approximation, of course.

How and what is the most efficient way to join two tables, retaining a particular field from both

It's not entirely clear what you are after but I think a FULL outer join would help:

SELECT 
    COALESCE(a.Component, b.Component) AS Component
  , COALESCE(a.data1, 0) AS data1
  , COALESCE(a.data2, 0) AS data2
  , COALESCE(b.data3, 0) AS data3
  , COALESCE(b.data4, 0) AS data4
FROM 
    table1 AS a
  FULL JOIN
    table2 AS b
      ON b.Component = a.Component ;

If Component is not UNIQUE on one (or both) of the tables, then you could aggreagte and then join:

WITH a AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data1) AS sum_data1
      , SUM(data2) AS sum_data2
      -- ...
      , AVG(data1) AS avg_data1
      -- ...
    GROUP BY
        Component
    FROM
        table1
  )
  , b AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data3) AS sum_data3
      -- ...
      , AVG(data3) AS avg_data3
      -- ...
    GROUP BY
        Component
    FROM
        table2
  ) 
    SELECT 
        COALESCE(a.Component, b.Component) AS Component
      , COALESCE(a.sum_data1,0) AS sum_data1
      -- ...
      , COALESCE(b.sum_data3,0) AS sum_data3
      -- ...
    FROM 
        a FULL JOIN b
            ON b.Component = a.Component ;

Best Answer

Related Solutions

Postgresql – Problem with shortest distance calculation in PostGIS

How and what is the most efficient way to join two tables, retaining a particular field from both

Related Question