PostgreSQL – How to Find Closest Match in the Same Table

postgresqlpostgresql-9.4timestamp

I have a table with timestamped data. For all the rows, I want to find the row that is the closest to 20 minutes after and between 15 and 25 minutes after.

e.g. if the table is like

CREATE TABLE foo(id,ts)
AS VALUES 
  ( 1::int, '2017.10.27T10:12:15'::timestamp with time zone ),
  ( 2,      '2017.10.27T10:24:17' ),
  ( 3,      '2017.10.27T10:30:22' ),
  ( 4,      '2017.10.27T10:40:12' ),
  ( 5,      '2017.10.27T10:52:16' ),
  ( 6,      '2017.10.27T10:53:11' );

I then run a query

 select t1.id as base t2.id as after
 from table t1, table t2 where ?? 

I would like to get the answer:

 base after
 1    3
 2    4
 3    5

For 4,5 and 6 as base I get no result as there are no rows matching with my criteria

It would be easy enough to get this for a given timestamp:

select id from table where timestamp > $mytimestamp+'00:15:00'
  and  timestamp < $mytimestamp + '00:25:00' 
order by 
   abs(extract ( epoch from (timestamp -($mytimestamp + '00:20:00'))) 
limit 1;

But how to do this for all the rows in a table?

Best Answer

You can use a CROSS JOIN LATERAL ( ... LIMIT 1 ) to accomplish this, along with btree_gist and <->

CREATE EXTENSION IF NOT EXISTS btree_gist;

SELECT f1.id AS base, f2.id AS after
FROM foo AS f1
CROSS JOIN LATERAL (
  SELECT id, ts
  FROM foo AS f2
  WHERE f2.ts BETWEEN f1.ts+'15 minutes' AND f1.ts+'25 minutes'
  ORDER BY f1.ts+'20 minutes' <-> f2.ts ASC
  LIMIT 1
) AS f2(id,ts);

 base | after 
------+-------
    1 |     3
    2 |     4
    3 |     5
(3 rows)

There are a lot of tricks going on here. We use <-> as a distance operator for your "closest to". Essentially, it can use a GiST index to determine this with KNN. We also use a CROSS JOIN LATERAL ... LIMIT 1 to get the list of all potential roles, and then limit it to just the one that is "closest to" as determined by our knn operator above.

This will work fastest with a gist index on ts.

CREATE INDEX ON foo USING gist ( ts );

See also another question using <-> on a time stamp