PostgreSQL – Recursive Lookup Query Design

functionspostgresqlrecursiveset-returning-functions

In PostgreSQL 9.4, here is my table:

with r(pk, userid,partneruserid) as 
(values 
(1,1,2),
(2,1,3),
(3,1,6),
(4,2,5),
(5,2,6),
(6,3,1),
(7,4,1),
(8,5,3),
(9,6,2),
(10,6,3)
) select * from r;

In a table that stores pairs of user id, i.e. two integer columns (user id and partner), we are requested to find this kind of relations:

User m and user n in the same record
(both cases userid = m && partneruserid = n and userid = n && partneruserid = m)
indirect partnership with a third X partner in common (in either two columns)
more indirect partnership with a third and a fourth X and Y partner in the chain which are partners between them: m .. x .. y .. n
So on and so forth until there are no more relationships or until the nest level is 5, meaning relation with five steps like:

users m .. x .. y .. z .. a .. b .. n (always containing the cases in either two columns).

In this table, if we ask for the relationship between users 1 and 3, the function should return:

Approach 1

first, the obvious row 2: 1-3 (single record, no hops)
Result set so far: [1,3]
and the not so obvious row 6: 3-1 (idem)
Result set so far:
```
[1,3]
[3,1]
```
then if we follow the paths from user 1 (which are users 2, 3 and 6) and are looking for a partner one hop from user 1, we would then reach rows in which partners of user 1 (user 2 and user 6) are in the record as partners of user 3. These rows could have been [2, 3], [3, 2], [6, 3] and [3, 6] if they existed, but only row 10 does, being the result set so far:
```
[1,3] [3,1] [6,3]
```
the next step would be search for rows that have user 2 and user 6 (first hop) partnered with partners of user 3 (1, 5 and 6) (this is the second hop), reaching: row 4 (1..row1..2..row5..6..row10..3), row 5 (1..row1..2..row4..5..row8..3) and row 9 (1..row1..2..row9..6..row10..3).
Result set so far:
```
[1,3]
[3,1] 
[6,3] 
[2,5] 
[2,6] 
[6,2]
```
- then with these last rows added to the result set, we would need to find all users that are partners of user 3 of user 5 that are partners of user 3. But as users 2 and 6 take us back to the first iteration, only remains user 5 for exploring his partnerships with a partner of user 3 (1, 5 and 6), being reached no more rows, hence no more hops or indirect partnerships can be found.
  The result set is still the same as the previous step.
- here would be nice to have the quantity of partnerships are repeated allowing us to reduce the result set to:
```
[1,3] qty: 2 
[6,3] qty: 1 
[2,5] qty: 1 
[2,6] qty: 2
```

That would be the desired result. I hope not have made a mistake trying to make sense. As I posted previously, any advice that points me in the right direction, would be much appreciated!

Approach 2 (cannot imagine how)

Get the direct partners of user 1 (2, 3 and 6)
Get the direct partners of user 3 (1, 5 and 6)
Make a tree of each, and look for points in common.

Best Answer

Data model

First of all: fold duplicate entries. I don't see why you would allow reverse duplicates to begin with, so your setup becomes:

CREATE TEMP TABLE tbl(
     pk serial PRIMARY KEY
   , userid int NOT NULL
   , partneruserid int NOT NULL
   , ct int DEFAULT 1 NOT NULL  -- this part is unclear
   , CONSTRAINT user_partner_uni UNIQUE (userid, partneruserid)
   , CONSTRAINT userid_smaller   CHECK  (userid < partneruserid)
   );

INSERT INTO tbl
VALUES 
  (1,1,2,1)
, (2,1,3,2)  -- folded
, (3,1,6,1)
, (4,2,5,1)
, (5,2,6,2)  -- folded
, (7,1,4,1)
, (8,3,5,1)
, (10,3,6,1);

Recursive query

Building on the sanitized model (the constraint userid_smaller is not required for this query, though). This recursive query does basically everything you need:

WITH RECURSIVE  ids (a, z) AS (SELECT 1, 3)  -- enter values here
, cte AS (
   SELECT t.pk, CASE i.a WHEN t.userid THEN t.partneruserid ELSE t.userid END = i.z AS found
        , ARRAY[CASE i.a WHEN t.userid THEN t.partneruserid ELSE t.userid END, i.a] AS path
   FROM   tbl t, ids i
   WHERE  i.a IN (userid, partneruserid)   -- constraint user_id_smaller simplifies this task

   UNION ALL
   SELECT t.pk
        , CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END = i.z AS found
        , CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END || c.path
   FROM   cte c, tbl t, ids i
   WHERE  c.path[1] IN (t.userid, t.partneruserid)
   AND    CASE c.path[1] WHEN t.userid THEN t.partneruserid ELSE t.userid END <> ALL (c.path)
   AND    NOT c.found
   AND    array_length(path,1) < 4
   )
SELECT path, array_length(path,1) - 1 AS steps
FROM   cte c
WHERE  found;

SQL Fiddle

Except for the count. It's unclear how you count duplicate connections exactly. Strictly speaking, you should multiply ct with every step ...

Explanation

The adapted table forces userid to be the lower ID, ~~so the initial predicate can be simplified: userid matches lower ID, partneruserid matches higher ID. If your input is parameterized, you can sort with LEAST (x,y) and GREATEST(x,y)~~ Update: That shortcut only works for short-circuiting in a single step. We need to consider both IDs as starting point for longer paths. Query fixed.

I pick the lower input value (a) as start (could be the other way round) and walk the graph of partnerships until I reach the higher input value (z). Each step can connect to userid or partneruserid, which necessitates the CASE expressions: the match could be on either column, we need to continue with the other column respectively.

We have to remember the path so far (path) and rule out repetitions so not to walk in circles (... <> ALL (c.path)).

I save the last node of the path as first element (path[1]), just to avoid another column for that information. Either way may be more convenient / faster ...

Stop traversing the graph when z is reached (found) or when a given maximum length for the path has been exhausted (array_length(path,1) < 4).

Only rows where z is reached (WHERE found) are in the final result set. I only collect the bare minimum of information along the way. You can add more columns to the query or / and JOIN to the original table in the outer query to return more fields.

Related Solutions

Sql-server – Does WITH SCHEMABINDING on a multi-statement TVF improve cardinality estimates

In my tests, no, adding WITH SCHEMABINDING does not improve cardinality estimates. I created a simple table:

CREATE TABLE dbo.myobjects(id INT PRIMARY KEY);

INSERT dbo.myobjects SELECT [object_id] FROM sys.all_objects;

Then two functions:

CREATE FUNCTION dbo.noschemabinding(@UserID INT)
RETURNS @x TABLE (id INT)
AS
BEGIN
  INSERT @x SELECT id FROM dbo.myobjects;

  RETURN;
END
GO

CREATE FUNCTION dbo.withschemabinding(@UserID INT)
RETURNS @x TABLE (id INT)
WITH SCHEMABINDING
AS
BEGIN
  INSERT @x SELECT id FROM dbo.myobjects;

  RETURN;
END
GO

Comparing the actual plans, both show estimated rows = 1, actual rows = 2112 (this latter number may differ on your system depending on version/SP etc).

Comparing the speed:

SET NOCOUNT ON;
GO
SELECT SYSDATETIME();
GO
SELECT id INTO #x FROM dbo.noschemabinding(1);
DROP TABLE #x;
GO 1000
GO
SELECT SYSDATETIME();
GO
SELECT id INTO #x FROM dbo.withschemabinding(1);
DROP TABLE #x;
GO 1000
SELECT SYSDATETIME();

Results:

                    run 1               run 2
----------------    ------------------  ------------------
No schemabinding    14632 milliseconds  14079 milliseconds
Schemabinding       14251 milliseconds  13979 milliseconds

So, does it matter much? Nope.

SCHEMABINDING in this case is used for a more important goal: underlying schema stability. You will probably have much better optimization opportunities if you pursue converting your function to an inline TVF than to chase down obscure plan-affecting differences in a multi-statement TVF.

MySQL Recursive Query – Using Stored Procedure and CURSOR

So the three questions I'm hoping someone can answer are:

1.Should I be using a CURSOR here or is there an alternative that will get the same recursive result?

No, you should not. The CURSOR keep being opened over and over. Thinking of the overhead make me cringe. Personally, I stay away from CURSORs.

2.How can I get the results back in a single result set so that it can be used in the same fashion as a subselect?

3.What is the proper way of using the results from the CALL because I can't at least as far as I've tried get the sample SELECT statement above to work? I believe this is because I can't use CALL inline but I'm not sure.

I would like to bail you out of this by suggesting something I learned from my college days when learning about data structures. You need to perform tree traversal using a queue. This allows you to start at a root and express all descendants of a tree.

The algorithm goes like this

STEP 01 : Start with an empty queue
STEP 02 : Dequeue Node From the Front of the Queue
STEP 03 : Enqueue All Children of the Latest Node
STEP 04 : Process Info From the Latest Node
STEP 05 : If the Queue is Not Empty, Go Back to STEP 02
STEP 06 : All Done

This allows you to traverse a recursive structure without using Programmatic Recursion. At this point, you are probably asking: How can I traverse a tree structure without recursion?

I wrote a post about how to script three Stored Procedures that can using a loop in a single CALL that will traverse a table with nodes and its parent in a table:

GetParentIDByID
GetAncestry
GetFamilyTree

The post is Find highest level of a hierarchical field: with vs without CTEs (Oct 24, 2011). It contains the Stored Procedures already written that will traverse the following table structure:

+------------+--------------+------+-----+---------+----------------+
| Field      | Type         | Null | Key | Default | Extra          |
+------------+--------------+------+-----+---------+----------------+
| id         | int(11)      | NO   | PRI | NULL    | auto_increment | 
| parent_id  | int(11)      | YES  |     | NULL    |                | 
| name       | varchar(255) | YES  |     | NULL    |                | 
| notes      | text         | YES  |     | NULL    |                | 
+------------+--------------+------+-----+---------+----------------+

Please read the code carefully and apply the principles.

Give it a Try !!!