Postgresql – How to optimize selection of pairs from one column of the table (self-join)

join;optimizationpostgresqlself-join

I'm using PostgreSQL 9.5.19, DBeaver 6.3.4

I have a table where one row is – user's name, place he attended, time when he was there

I need to select all pairs of places where any user was (if user was at place a and place b i need row like this: user, place a, place b, time at place a, time at place b)

The ponds table:

CREATE TABLE example.example (
    tm timestamp NOT NULL,
    place_name varchar NOT NULL,
    user_name varchar NOT NULL
);

Some sample data:

INSERT INTO example.example (tm, place_name, user_name)
values
('2020-02-25 00:00:19.000', 'place_1', 'user_1'),
('2020-03-25 00:00:19.000', 'place_2', 'user_1'),
('2020-02-25 00:00:19.000', 'place_1', 'user_2'),
('2020-03-25 00:00:19.000', 'place_1', 'user_3'),
('2020-02-25 00:00:19.000', 'place_2', 'user_3');

I'm trying this script:

select 
   t.user_name    
  ,t.place_name as r1_place
  ,max(t.tm) as r1_tm
  ,t2.place_name as r2_place
  ,min(t2.tm) as r2_tm
from example.example as t
join example.example as t2 on t.user_name = t2.user_name 
                       and t.tm < t2.tm 
                       and t.place_name <> t2.place_name
where t.tm between '2020-02-25 00:00:00' and '2020-03-25 15:00:00' 
  and t2.tm between '2020-02-25 00:00:00' and '2020-03-25 15:00:00'
    group by t.user_name
       , t.place_name
       , t2.place_name

Seems like it gives me the right result, but it works really slow.
Can I optimize it somehow?

Best Answer

Postgresql 9.5.19 has windowing functions that prove helpful in such situation. The lead() function give you access to the next row in a "partition".

You could try something like that :

SELECT
  user_name,
  place_name AS r1_place,
  tm AS r1_tm,
  lead(place_name) OVER (PARTITION BY user_name ORDER BY tm) AS r2_place,
  lead(tm) OVER (PARTITION BY user_name ORDER BY tm) AS r2_tm
FROM example
ORDER BY 1, 3;

resulting in :

user_name|r1_place|r1_tm              |r2_place|r2_tm              |
---------|--------|-------------------|--------|-------------------|
user_1   |place_1 |2020-02-25 00:00:19|place_2 |2020-03-25 00:00:19|
user_1   |place_2 |2020-03-25 00:00:19|        |                   |
user_2   |place_1 |2020-02-25 00:00:19|        |                   |
user_3   |place_2 |2020-02-25 00:00:19|place_1 |2020-03-25 00:00:19|
user_3   |place_1 |2020-03-25 00:00:19|        |                   |

Not sure about the performance part however... you should make some tests.

Of course, you can filter out null results:

SELECT * FROM (
  SELECT
    user_name,
    place_name AS r1_place,
    tm AS r1_tm,
    lead(place_name) OVER (PARTITION BY user_name ORDER BY tm) AS r2_place,
    lead(tm) OVER (PARTITION BY user_name ORDER BY tm) AS r2_tm
  FROM example
  ORDER BY 1, 3) req
WHERE r2_place IS NOT null

Related Solutions

Mysql – Query to compare two subsets of data from the same table

I think you have to hack it a bit with a derived table, AKA an implicit temporary table, AKA a "subquery in the from clause."

We derive a table we'll call `t` containing each distinct (server,software) from gocore, then left join to gocore twice, once on tag = 'old' and once on tag = 'new'.

SELECT t.server, t.software, o.revision AS old_rev, n.revision AS new_rev
  FROM (SELECT DISTINCT server, software FROM gocore) t
  LEFT JOIN gocore o ON o.server = t.server AND o.software = t.software AND o.tag = 'old'
  LEFT JOIN gocore n ON n.server = t.server AND n.software = t.software AND n.tag = 'new';

PostgreSQL Self-Join – How to Create Unique Pairs

Every row of the result depends on the previous row. A ~~recursive CTE~~ comes to mind, I tried that. But one would need to refer to the worktable in an OUTER JOIN or a subquery expression which is not allowed. This does not work (building on the table layout in my fiddle):

WITH RECURSIVE
   t0 AS (SELECT *, COALESCE(array_length(opp_log,1), 0) AS len FROM tbl)
,  t1 AS (
   SELECT t1.player_id AS pl, t2.player_id AS p2
         ,t1.len AS len1, t2.len AS len2
   FROM   t0 t1, t0 t2 
   WHERE  t2.player_id <> t1.player_id
   AND    t2.player_id <> ALL (t1.opp_log)
   )
, cte AS (
   (
   SELECT pl, p2
   FROM   t1
   ORDER  BY len1 DESC, len2 DESC
   LIMIT  1
   )

   UNION ALL
   (
   SELECT pl, p2
   FROM   t1
   LEFT   JOIN cte c ON t1.p1 IN (c.p1, c.p2)
                        OR t1.p2 IN (c.p1, c.p2)
   WHERE  c.p1 IS NULL
   ORDER  BY len1 DESC, len2 DESC
   LIMIT  1
   )
   )
SELECT *
FROM   cte;

> ERROR:  recursive reference to query "cte" must not appear within an outer join

I don't think there is a halfway decent way to solve this with pure SQL. I suggest:

Procedural solution with PL/pgSQL

CREATE OR REPLACE FUNCTION f_next_round()
  RETURNS TABLE (player_id1 int, player_id2 int) AS
$func$
DECLARE
   rows int := (SELECT count(*)/2 FROM tbl);  -- expected number of resulting rows
   ct   int := 0;                             -- running count
BEGIN

CREATE TEMP TABLE t ON COMMIT DROP AS         -- possible combinations
SELECT t1.player_id AS p1, t2.player_id AS p2
     , COALESCE(array_length(t1.opp_log,1), 0) AS len1
     , COALESCE(array_length(t2.opp_log,1), 0) AS len2
FROM   tbl t1, tbl t2 
WHERE  t2.player_id <> t1.player_id
AND    t2.player_id <> ALL (t1.opp_log)
AND    t1.player_id <> ALL (t2.opp_log)
ORDER  BY len1 DESC, len2 DESC;               -- opportune sort order

LOOP
   SELECT INTO player_id1, player_id2  p1, p2 FROM t LIMIT 1;

   EXIT WHEN NOT FOUND;
   RETURN NEXT;
   ct := ct + 1;                              -- running count

   DELETE FROM t                              -- remove obsolete pairs
   WHERE  p1 IN (player_id1, player_id2) OR 
          p2 IN (player_id1, player_id2);
END LOOP;

IF ct < rows THEN
   RAISE EXCEPTION 'Could not find a solution';
ELSIF ct > rows THEN
   RAISE EXCEPTION 'Impossible result!';
END IF;

END
$func$  LANGUAGE plpgsql VOLATILE;

How?

Build a temporary table with remaining possible pairs. This kind of cross join produces a lot of rows with big tables, but since we seem to be talking about tournaments, numbers should be reasonably low.

Players with the longest list of opponents are sorted first. This way, players that would be hard to match come first, increasing the chance for a solution.

Pick the first row and delete related pairings now obsolete. Do need to sort again. Logically any row is good, practically we get the player with the longest list of opponents first due to initial sort (which is not reliable without ORDER BY, but good enough for the case).

Repeat until no match is left.
Keep count and raise an exception if the count is not as expected. PL/pgSQL conveniently allows to raise an exception after the fact, which cancels any previous return values. Details in the manual.

Call:

SELECT * FROM f_next_round();

Result:

player_id1 | player_id2
-----------+-----------
1          | 7
2          | 3
4          | 8
5          | 6

SQL Fiddle.

Note

This does not guarantee to calculate the perfect solution. I just returns a possible solution and uses some limited smarts to improve the chances to find one. The problem is a bit like solving a Sudoku, really and is not trivially solved perfectly.