PostgreSQL – Recursive Lookup Query Design

functionspostgresqlrecursiveset-returning-functions

In PostgreSQL 9.4, here is my table:

with r(pk, userid,partneruserid) as 
(values 
(1,1,2),
(2,1,3),
(3,1,6),
(4,2,5),
(5,2,6),
(6,3,1),
(7,4,1),
(8,5,3),
(9,6,2),
(10,6,3)
) select * from r;

In a table that stores pairs of user id, i.e. two integer columns (user id and partner), we are requested to find this kind of relations:

  • User m and user n in the same record
    (both cases userid = m && partneruserid = n and userid = n && partneruserid = m)
  • indirect partnership with a third X partner in common (in either two columns)
  • more indirect partnership with a third and a fourth X and Y partner in the chain which are partners between them: m .. x .. y .. n
  • So on and so forth until there are no more relationships or until the nest level is 5, meaning relation with five steps like:

    users m .. x .. y .. z .. a .. b .. n (always containing the cases in either two columns).

In this table, if we ask for the relationship between users 1 and 3, the function should return:

Approach 1

  • first, the obvious row 2: 1-3 (single record, no hops)
    Result set so far: [1,3]
  • and the not so obvious row 6: 3-1 (idem)
    Result set so far:

    [1,3]
    [3,1]
    
  • then if we follow the paths from user 1 (which are users 2, 3 and 6) and are looking for a partner one hop from user 1, we would then reach rows in which partners of user 1 (user 2 and user 6) are in the record as partners of user 3. These rows could have been [2, 3], [3, 2], [6, 3] and [3, 6] if they existed, but only row 10 does, being the result set so far:

    [1,3] [3,1] [6,3]
    
  • the next step would be search for rows that have user 2 and user 6 (first hop) partnered with partners of user 3 (1, 5 and 6) (this is the second hop), reaching: row 4 (1..row1..2..row5..6..row10..3), row 5 (1..row1..2..row4..5..row8..3) and row 9 (1..row1..2..row9..6..row10..3).
    Result set so far:

    [1,3]
    [3,1] 
    [6,3] 
    [2,5] 
    [2,6] 
    [6,2]
    
    • then with these last rows added to the result set, we would need to find all users that are partners of user 3 of user 5 that are partners of user 3. But as users 2 and 6 take us back to the first iteration, only remains user 5 for exploring his partnerships with a partner of user 3 (1, 5 and 6), being reached no more rows, hence no more hops or indirect partnerships can be found.
      The result set is still the same as the previous step.

    • here would be nice to have the quantity of partnerships are repeated allowing us to reduce the result set to:

      [1,3] qty: 2 
      [6,3] qty: 1 
      [2,5] qty: 1 
      [2,6] qty: 2
      

That would be the desired result. I hope not have made a mistake trying to make sense. As I posted previously, any advice that points me in the right direction, would be much appreciated!

Approach 2 (cannot imagine how)

  • Get the direct partners of user 1 (2, 3 and 6)
  • Get the direct partners of user 3 (1, 5 and 6)
  • Make a tree of each, and look for points in common.

Best Answer

Data model

First of all: fold duplicate entries. I don't see why you would allow reverse duplicates to begin with, so your setup becomes:

CREATE TEMP TABLE tbl(
     pk serial PRIMARY KEY
   , userid int NOT NULL
   , partneruserid int NOT NULL
   , ct int DEFAULT 1 NOT NULL  -- this part is unclear
   , CONSTRAINT user_partner_uni UNIQUE (userid, partneruserid)
   , CONSTRAINT userid_smaller   CHECK  (userid < partneruserid)
   );

INSERT INTO tbl
VALUES 
  (1,1,2,1)
, (2,1,3,2)  -- folded
, (3,1,6,1)
, (4,2,5,1)
, (5,2,6,2)  -- folded
, (7,1,4,1)
, (8,3,5,1)
, (10,3,6,1);

Recursive query

Building on the sanitized model (the constraint userid_smaller is not required for this query, though). This recursive query does basically everything you need:

WITH RECURSIVE  ids (a, z) AS (SELECT 1, 3)  -- enter values here
, cte AS (
   SELECT t.pk, CASE i.a WHEN t.userid THEN t.partneruserid ELSE t.userid END = i.z AS found
        , ARRAY[CASE i.a WHEN t.userid THEN t.partneruserid ELSE t.userid END, i.a] AS path
   FROM   tbl t, ids i
   WHERE  i.a IN (userid, partneruserid)   -- constraint user_id_smaller simplifies this task

   UNION ALL
   SELECT t.pk
        , CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END = i.z AS found
        , CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END || c.path
   FROM   cte c, tbl t, ids i
   WHERE  c.path[1] IN (t.userid, t.partneruserid)
   AND    CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END <> ALL (c.path)
   AND    NOT c.found
   AND    array_length(path,1) < 4
   )
SELECT path, array_length(path,1) - 1 AS steps
FROM   cte c
WHERE  found;

SQL Fiddle

Except for the count. It's unclear how you count duplicate connections exactly. Strictly speaking, you should multiply ct with every step ...

Explanation

The adapted table forces userid to be the lower ID, so the initial predicate can be simplified: userid matches lower ID, partneruserid matches higher ID. If your input is parameterized, you can sort with LEAST (x,y) and GREATEST(x,y) Update: That shortcut only works for short-circuiting in a single step. We need to consider both IDs as starting point for longer paths. Query fixed.

I pick the lower input value (a) as start (could be the other way round) and walk the graph of partnerships until I reach the higher input value (z). Each step can connect to userid or partneruserid, which necessitates the CASE expressions: the match could be on either column, we need to continue with the other column respectively.

We have to remember the path so far (path) and rule out repetitions so not to walk in circles (... <> ALL (c.path)).

I save the last node of the path as first element (path[1]), just to avoid another column for that information. Either way may be more convenient / faster ...

Stop traversing the graph when z is reached (found) or when a given maximum length for the path has been exhausted (array_length(path,1) < 4).

Only rows where z is reached (WHERE found) are in the final result set. I only collect the bare minimum of information along the way. You can add more columns to the query or / and JOIN to the original table in the outer query to return more fields.