Postgresql – How to optimize selection of pairs from one column of the table (self-join)

join;optimizationpostgresqlself-join

I'm using PostgreSQL 9.5.19, DBeaver 6.3.4

I have a table where one row is – user's name, place he attended, time when he was there

I need to select all pairs of places where any user was (if user was at place a and place b i need row like this: user, place a, place b, time at place a, time at place b)

The ponds table:

CREATE TABLE example.example (
    tm timestamp NOT NULL,
    place_name varchar NOT NULL,
    user_name varchar NOT NULL
);

Some sample data:

INSERT INTO example.example (tm, place_name, user_name)
values
('2020-02-25 00:00:19.000', 'place_1', 'user_1'),
('2020-03-25 00:00:19.000', 'place_2', 'user_1'),
('2020-02-25 00:00:19.000', 'place_1', 'user_2'),
('2020-03-25 00:00:19.000', 'place_1', 'user_3'),
('2020-02-25 00:00:19.000', 'place_2', 'user_3');

I'm trying this script:

select 
   t.user_name    
  ,t.place_name as r1_place
  ,max(t.tm) as r1_tm
  ,t2.place_name as r2_place
  ,min(t2.tm) as r2_tm
from example.example as t
join example.example as t2 on t.user_name = t2.user_name 
                       and t.tm < t2.tm 
                       and t.place_name <> t2.place_name
where t.tm between '2020-02-25 00:00:00' and '2020-03-25 15:00:00' 
  and t2.tm between '2020-02-25 00:00:00' and '2020-03-25 15:00:00'
    group by t.user_name
       , t.place_name
       , t2.place_name

Seems like it gives me the right result, but it works really slow.
Can I optimize it somehow?

Best Answer

Postgresql 9.5.19 has windowing functions that prove helpful in such situation. The lead() function give you access to the next row in a "partition".

You could try something like that :

SELECT
  user_name,
  place_name AS r1_place,
  tm AS r1_tm,
  lead(place_name) OVER (PARTITION BY user_name ORDER BY tm) AS r2_place,
  lead(tm) OVER (PARTITION BY user_name ORDER BY tm) AS r2_tm
FROM example
ORDER BY 1, 3;

resulting in :

user_name|r1_place|r1_tm              |r2_place|r2_tm              |
---------|--------|-------------------|--------|-------------------|
user_1   |place_1 |2020-02-25 00:00:19|place_2 |2020-03-25 00:00:19|
user_1   |place_2 |2020-03-25 00:00:19|        |                   |
user_2   |place_1 |2020-02-25 00:00:19|        |                   |
user_3   |place_2 |2020-02-25 00:00:19|place_1 |2020-03-25 00:00:19|
user_3   |place_1 |2020-03-25 00:00:19|        |                   |

Not sure about the performance part however... you should make some tests.

Of course, you can filter out null results:

SELECT * FROM (
  SELECT
    user_name,
    place_name AS r1_place,
    tm AS r1_tm,
    lead(place_name) OVER (PARTITION BY user_name ORDER BY tm) AS r2_place,
    lead(tm) OVER (PARTITION BY user_name ORDER BY tm) AS r2_tm
  FROM example
  ORDER BY 1, 3) req
WHERE r2_place IS NOT null