PostgreSQL Self-Join – How to Create Unique Pairs

join;plpgsqlpostgresqlrecursive

I have a table (players) containing a list of players, I would like to pair up these players in a unique way, so that every player's opponent is unique every round (ie. every time the SELECT query is called).

The player_opponent_log column is an integer[] which contains the player_ids of players who have played with that player in a previous round (and is used to help pick out unique players). This column is populated afterwards using the results of the SELECT query – and is outside the scope of this question.

The table for instance would have the following data;

 player_id | player_opponent_log 
-----------+---------------------
         1 | {2,3}
         2 | {1}
         3 | {1}
         4 | {}
         5 | {}
         6 | {}
         7 | {8}
         8 | {7}

What I am trying to achieve would be something along the following lines:

 player_id1 | player_id2 
------------+------------
          1 |          4
          2 |          3
          5 |          7
          6 |          8

I have tried countless different approaches, including GROUP BY, DISTINCT ON (), self JOINS, but haven't been able to come to a working solution.

The following result is what I am currently getting and am trying to avoid:

 player_id1 | player_id2 
------------+------------
          1 |          4
          2 |          3
          3 |          4
          4 |          1
          5 |          1
          6 |          1
          7 |          1
          8 |          1

Where I'm getting stuck is how to eliminate a single player being allocated for a round more than once per SELECT query.

Any ideas on how to solve this would be highly appreciated.

Best Answer

Every row of the result depends on the previous row. A ~~recursive CTE~~ comes to mind, I tried that. But one would need to refer to the worktable in an OUTER JOIN or a subquery expression which is not allowed. This does not work (building on the table layout in my fiddle):

WITH RECURSIVE
   t0 AS (SELECT *, COALESCE(array_length(opp_log,1), 0) AS len FROM tbl)
,  t1 AS (
   SELECT t1.player_id AS pl, t2.player_id AS p2
         ,t1.len AS len1, t2.len AS len2
   FROM   t0 t1, t0 t2 
   WHERE  t2.player_id <> t1.player_id
   AND    t2.player_id <> ALL (t1.opp_log)
   )
, cte AS (
   (
   SELECT pl, p2
   FROM   t1
   ORDER  BY len1 DESC, len2 DESC
   LIMIT  1
   )

   UNION ALL
   (
   SELECT pl, p2
   FROM   t1
   LEFT   JOIN cte c ON t1.p1 IN (c.p1, c.p2)
                        OR t1.p2 IN (c.p1, c.p2)
   WHERE  c.p1 IS NULL
   ORDER  BY len1 DESC, len2 DESC
   LIMIT  1
   )
   )
SELECT *
FROM   cte;

> ERROR:  recursive reference to query "cte" must not appear within an outer join

I don't think there is a halfway decent way to solve this with pure SQL. I suggest:

Procedural solution with PL/pgSQL

CREATE OR REPLACE FUNCTION f_next_round()
  RETURNS TABLE (player_id1 int, player_id2 int) AS
$func$
DECLARE
   rows int := (SELECT count(*)/2 FROM tbl);  -- expected number of resulting rows
   ct   int := 0;                             -- running count
BEGIN

CREATE TEMP TABLE t ON COMMIT DROP AS         -- possible combinations
SELECT t1.player_id AS p1, t2.player_id AS p2
     , COALESCE(array_length(t1.opp_log,1), 0) AS len1
     , COALESCE(array_length(t2.opp_log,1), 0) AS len2
FROM   tbl t1, tbl t2 
WHERE  t2.player_id <> t1.player_id
AND    t2.player_id <> ALL (t1.opp_log)
AND    t1.player_id <> ALL (t2.opp_log)
ORDER  BY len1 DESC, len2 DESC;               -- opportune sort order

LOOP
   SELECT INTO player_id1, player_id2  p1, p2 FROM t LIMIT 1;

   EXIT WHEN NOT FOUND;
   RETURN NEXT;
   ct := ct + 1;                              -- running count

   DELETE FROM t                              -- remove obsolete pairs
   WHERE  p1 IN (player_id1, player_id2) OR 
          p2 IN (player_id1, player_id2);
END LOOP;

IF ct < rows THEN
   RAISE EXCEPTION 'Could not find a solution';
ELSIF ct > rows THEN
   RAISE EXCEPTION 'Impossible result!';
END IF;

END
$func$  LANGUAGE plpgsql VOLATILE;

How?

Build a temporary table with remaining possible pairs. This kind of cross join produces a lot of rows with big tables, but since we seem to be talking about tournaments, numbers should be reasonably low.

Players with the longest list of opponents are sorted first. This way, players that would be hard to match come first, increasing the chance for a solution.

Pick the first row and delete related pairings now obsolete. Do need to sort again. Logically any row is good, practically we get the player with the longest list of opponents first due to initial sort (which is not reliable without ORDER BY, but good enough for the case).

Repeat until no match is left.
Keep count and raise an exception if the count is not as expected. PL/pgSQL conveniently allows to raise an exception after the fact, which cancels any previous return values. Details in the manual.

Call:

SELECT * FROM f_next_round();

Result:

player_id1 | player_id2
-----------+-----------
1          | 7
2          | 3
4          | 8
5          | 6

SQL Fiddle.

Note

This does not guarantee to calculate the perfect solution. I just returns a possible solution and uses some limited smarts to improve the chances to find one. The problem is a bit like solving a Sudoku, really and is not trivially solved perfectly.

Related Solutions

SQL query to count number of payments made in a time period with data from two different table

Here is a query that should give you what you are looking for. This is Oracle SQL syntax, but may work in other databases as well.

SELECT dau.datetime, count(*) FROM DAU
WHERE EXISTS (select 1 from Payer 
   WHERE Payer.user_id=DAU.user_id 
   AND Payer.payment_timestamp BETWEEN trunc(dau.datetime)-90 AND trunc(DAU.datetime))
GROUP BY dau.datetime
ORDER BY DateTime;

Data to demonstrate:

create table Payer as (select 1001 user_id, to_date('06/01/2015 07:00','MM/DD/YYYY HH24:MI') payment_timestamp from dual);
insert into Payer values (1001, to_date('06/01/2015 07:30','MM/DD/YYYY HH24:MI'));
insert into Payer values (1002, to_date('06/01/2015 10:00','MM/DD/YYYY HH24:MI'));
insert into Payer values (1001, to_date('06/02/2015 13:50','MM/DD/YYYY HH24:MI'));
insert into Payer values (1002, to_date('06/02/2015 08:00','MM/DD/YYYY HH24:MI'));
insert into Payer values (1004, to_date('06/02/2015 09:00','MM/DD/YYYY HH24:MI'));
insert into Payer values (1003, to_date('06/03/2015 10:00','MM/DD/YYYY HH24:MI'));
insert into Payer values (1005, to_date('06/03/2015 08:30','MM/DD/YYYY HH24:MI'));

create table DAU as (select 1001 user_id, to_date('06/01/2015','MM/DD/YYYY') datetime from dual);
insert into DAU values (1002, to_date('06/01/2015','MM/DD/YYYY'));
insert into DAU values (1003, to_date('06/01/2015','MM/DD/YYYY'));
insert into DAU values (1001, to_date('06/02/2015','MM/DD/YYYY'));
insert into DAU values (1002, to_date('06/02/2015','MM/DD/YYYY'));
insert into DAU values (1004, to_date('06/02/2015','MM/DD/YYYY'));
insert into DAU values (1002, to_date('06/03/2015','MM/DD/YYYY'));
insert into DAU values (1003, to_date('06/03/2015','MM/DD/YYYY'));
insert into DAU values (1005, to_date('06/03/2015','MM/DD/YYYY'));

PostgreSQL – How to Limit an Inner Query by Outer Query Attribute

First, if you SELECT only from the teams table and there is a FOREIGN KEY constraint that REFERENCES players, you don't need the join to players at all.

Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, greatest-n-per-group. We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players, you can simply join and add the columns in the select list.:

SELECT DISTINCT ON (t.player_id)
  t.player_id,
  t.team_name,
  t.start_date,
  t.end_date
FROM teams AS t
  -- JOIN players AS p            
  --   ON p.id = t.player_id
ORDER BY 
    t.player_id,  -- this needs to match the DISTINCT ON ()
    ....      -- pick what order you want, so the 1st is chosen for each player
 ;            -- no LIMIT

Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER clause:

SELECT 
    g.player_id,
    g.team_name,
    g.start_date,
    g.end_date
FROM
  ( SELECT 
      t.player_id,
      t.team_name,
      t.start_date,
      t.end_date,
      ROW_NUMBER() OVER (PARTITION BY t.player_id 
                         ORDER BY ...) AS rn
    FROM teams AS t
      -- JOIN players AS p            
      --   ON p.id = t.player_id
  ) AS g
WHERE
    g.rn = 1 ;

If you are not in some ancient version of Postgres, there is also the LATERAL syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams table is, in comparison to the players, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players columns to be referenced inside the lt subquery:

SELECT
    lt.player_id,
    lt.team_name,
    lt.start_date,
    lt.end_date
FROM players AS p
  JOIN LATERAL
    ( SELECT t.*
      FROM teams AS t 
      WHERE p.id = t.player_id
      ORDER BY ....     -- pick the order
      LIMIT 1 
   ) AS lt ON TRUE ;