Postgresql – PL/pgSQL function or rCTE to detect depth of relations between two tables

cteplpgsqlpostgresqlrecursive

I need to perform a PostgreSQL Function for checking the depth of relations. There is a circular binding from table a to b, back to another element in a and, in some cases back again to b. These relation is possible several times. I try to build a function that request these relations.

I tried to build a simply SQL query but it breaks with memory overflow error, building a result table with more then 100 million rows. When checking the full table with more then 10 million rows. So I try a FOR loop which refuse storage after each loop.

Problem is now the loop runs to the first element which counts to relation 1 but not further. All further relations got the relation count 0, and bstid remains at a single value.

I put some RETURN NEXT statements to get debugging controls. Is there something wrong with the nested IF-ELSE statements?

CREATE OR REPLACE FUNCTION bst_ebenen() RETURNS SETOF varchar(1000) AS 
$BODY$ 
DECLARE 
    --an_array varchar[bstobjid][id];
    bstobjid varchar;
    loopid bigint;
    bstid bigint;
    relation varchar;
    searchsql text := '';
    ebenen integer := 0;
    message varchar := '';

BEGIN
    searchsql = 'SELECT objid AS bstobjid FROM buchstelle';

    FOR bstobjid IN EXECUTE(searchsql) LOOP

        ebenen = 0;

        --1. Relationsebene abfragen
        loopid = (SELECT bst.id FROM buchstelle bst WHERE bst.objid = bstobjid);

        IF loopid NOT IN (SELECT rid FROM buchstelle__relation LIMIT 1) 
            THEN ebenen = 0;                    
        ELSE 
        ebenen = 1;

            relation = (SELECT rel.relation FROM buchstelle__relation rel
                        WHERE rel.rid = loopid LIMIT 1);

            RETURN NEXT relation;

            --2. Relationsebene abfragen

            bstid = (SELECT bst.id FROM buchstelle bst WHERE bst.objid = relation LIMIT 1);
            RETURN NEXT bstid;

            IF bstid NOT IN (SELECT rid FROM buchstelle__relation LIMIT 1)
            THEN ebenen = ebenen;
            ELSE 
                ebenen = 2; 
                relation = (SELECT rel.relation FROM buchstelle__relation rel
                            WHERE rel.rid = bstid LIMIT 1);

                message = (ebenen, 'Ebenen');
                RETURN NEXT message;
                RETURN NEXT relation;

                --3. Relationsebene abfragen

                bstid = (SELECT bst.id FROM buchstelle bst WHERE bst.objid = relation LIMIT 1);
                RETURN NEXT bstid;

                IF bstid NOT IN (SELECT rid FROM buchstelle__relation LIMIT 1)
                    THEN ebenen = ebenen;

                ELSE 
                    ebenen = 3; 
                    relation = (SELECT rel.relation FROM buchstelle__relation rel
                                WHERE rel.rid = bstid LIMIT 1);


                    --4. Relationsebene abfragen

                    bstid = (SELECT bst.id FROM buchstelle bst WHERE bst.objid = relation LIMIT 1);
                    RETURN NEXT bstid;

                    IF bstid NOT IN (SELECT rid FROM buchstelle__relation LIMIT 1)
                        THEN ebenen = ebenen;

                    ELSE 
                        ebenen = 4; 
                        relation = (SELECT rel.relation FROM buchstelle__relation rel
                                    WHERE rel.rid = bstid LIMIT 1);


                        --5. Relationsebene abfragen

                        bstid = (SELECT bst.id FROM buchstelle bst WHERE bst.objid = relation LIMIT 1);

                        IF bstid NOT IN (SELECT rid FROM buchstelle__relation LIMIT 1)
                            THEN ebenen = ebenen;

                        ELSE 
                            ebenen = 5; 
                            relation = (SELECT rel.relation FROM buchstelle__relation rel
                                        WHERE rel.rid = bstid LIMIT 1);

                        END IF;
                    END IF;
                END IF;     
            END IF;
        END IF;

        -- Ausgabestring für das jeweilige Objekt

        message = (bstobjid, loopid, ' mit ', ebenen, ' Relationsebenen');

        RETURN NEXT message;

    END LOOP;

-- Ausgabestring für die Gesamtprüfung  

message = (max(ebenen), ' ist die maximale Relationstiefe');

RETURN NEXT message;

END;
$BODY$
 LANGUAGE plpgsql;

[EDIT]

@horse_with_no_name:

Yes, before I walk the function way, I tried this one:

WITH RECURSIVE ebenen (objid_bst, rid_anrel, verweistauf_bst)
AS (

SELECT bst.objid AS objid_bst, anrel.rid As rid_anrel, anrel.relation AS verweistauf_bst
FROM buchstelle bst
LEFT JOIN buchstelle__relation anrel ON bst.id = anrel.rid 

UNION ALL

SELECT bst.objid, anrel.rid, anrel.relation
FROM ebenen, buchstelle bst
INNER JOIN buchstelle__relation anrel ON anrel.relation = bst.objid
WHERE bst.id IN (anrel.rid)
)

SELECT 
objid_bst, rid_anrel, verweistauf_bst
, count(rid_anrel) OVER (PARTITION BY objid_bst) AS relationen
 FROM ebenen
 ORDER BY relationen
--LIMIT 1000000;

And at the beginning a nested SQL-Query with several subqueries. First two steps came with memory overflow by running the queries on the full dataset.

[EDIT 30.7.2014]

I looked for cyclic directed graph relations to find a solution for my task. But they deal with realtions within a single table. I need to handle a cylce between two tables. For better understanding here a simplified relationship as it may occur. Each jump from tab2 back to tab1 in this path marks one relation and should be counted. Maybe I'll be able to solve the task by putting a CTE in a functions FOR-Loop? But I need more understanding about how CTEs really work.

tab1
id          objid 
1            aaa
2            bbb
3            ccc
4            ddd  
5            eee 
6            fff           
7            ggg
8            hhh

tab2
id       rid         rel2tab1         
1         3            aaa  
2         4            aaa
3         7            hhh
4         8            ccc

relations for each element in tab1
1
2
3 --> 1(tab2) --> 1(tab1)
4 --> 2(tag2) --> 1(tab1)
5 
6
7 --> 3(tab2) --> 8(tab1) --> 4(tab2) --> 3(tab1) --> 1(tab2) --> 1(tab1)
8 --> 4(tab2) --> 3(tab1) --> 1(tab2) --> 1(tab1)

[EDIT: 19.08.2014]

After some days, I took the task again and discovered a much easier solution which comes with high-performance. Maybe someone else can use it for similar tasks:

SELECT CASE WHEN
(SELECT sum(t1.id)
FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t2.rid IS NOT NULL) IS NULL THEN '0' 
WHEN
(SELECT sum(t1.id)
FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t1.objid IN 
(
    SELECT t2.rel_t1
    FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
    WHERE t2.rid IS NOT NULL
)
AND t2.rel_t1 IS NOT NULL) IS NULL THEN '1'
WHEN
(SELECT  sum(t1.id)
FROM
table1 t1
LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t1.objid IN (
    SELECT t2.rel_t1
    FROM
        table1 t1
        LEFT JOIN table2 t2 ON t2.rid = t1.id
    WHERE t1.objid IN 
    (
        SELECT t2.rel_t1
        FROM
        table1 t1
        LEFT JOIN table2 t2 ON t2.rid = t1.id
        WHERE t2.rid IS NOT NULL
    )
    AND t2.rel_t1 IS NOT NULL
)
AND t2.rel_t1 IS NOT NULL) IS NULL THEN '2'
WHEN
(SELECT sum(t1.id)
FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t1.objid IN 
(
    SELECT t2.rel_t1
    FROM
        table1 t1
        LEFT JOIN table2 t2 ON t2.rid = t1.id
    WHERE t1.objid IN 
    (
        SELECT t2.rel_t1
        FROM
            table1 t1
            LEFT JOIN table2 t2 ON t2.rid = t1.id
        WHERE t1.objid IN 
        (
            SELECT t2.rel_t1
            FROM
                table1 t1
                LEFT JOIN table2 t2 ON t2.rid = t1.id
            WHERE t2.rid IS NOT NULL
        )
        AND t2.rel_t1 IS NOT NULL
    )
    AND t2.rel_t1 IS NOT NULL
)
AND t2.rel_t1 IS NOT NULL) IS NULL THEN '3'
WHEN
(SELECT sum(t1.id)
FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t1.objid IN 
(
    SELECT t2.rel_t1
    FROM
        table1 t1
        LEFT JOIN table2 t2 ON t2.rid = t1.id
    WHERE t1.objid IN 
    (
        SELECT t2.rel_t1
        FROM
            table1 t1
            LEFT JOIN table2 t2 ON t2.rid = t1.id
        WHERE t1.objid IN 
        (
            SELECT t2.rel_t1
            FROM
                table1 t1
                LEFT JOIN table2 t2 ON t2.rid = t1.id
            WHERE t1.objid IN 
            (
                SELECT t2.rel_t1
                FROM
                    table1 t1
                    LEFT JOIN table2 t2 ON t2.rid = t1.id
                WHERE t2.rid IS NOT NULL
            )
            AND t2.rel_t1 IS NOT NULL
        )
        AND t2.rel_t1 IS NOT NULL
    )
    AND t2.rel_t1 IS NOT NULL
)
AND t2.rel_t1 IS NOT NULL) IS NULL THEN '4' 
WHEN
(SELECT sum(t1.id)
FROM
    table1 t1
    LEFT JOIN table2 t2 ON t2.rid = t1.id
WHERE t1.objid IN 
(
    SELECT t2.rel_t1
    FROM
        table1 t1
        LEFT JOIN table2 t2 ON t2.rid = t1.id
    WHERE t1.objid IN 
    (
        SELECT t2.rel_t1
        FROM
            table1 t1
            LEFT JOIN table2 t2 ON t2.rid = t1.id
        WHERE t1.objid IN 
        (
            SELECT t2.rel_t1
            FROM
                table1 t1
                LEFT JOIN table2 t2 ON t2.rid = t1.id
            WHERE t1.objid IN 
            (
                SELECT t2.rel_t1
                FROM
                    table1 t1
                    LEFT JOIN table2 t2 ON t2.rid = t1.id
                WHERE t1.objid IN 
                (
                    SELECT t2.rel_t1
                    FROM
                        table1 t1
                        LEFT JOIN table2 t2 ON t2.rid = t1.id
                    WHERE t2.rid IS NOT NULL
                )
                AND t2.rel_t1 IS NOT NULL
            )
            AND t2.rel_t1 IS NOT NULL
        )
        AND t2.rel_t1 IS NOT NULL
    )
    AND t2.rel_t1 IS NOT NULL
)
AND t2.rel_t1 IS NOT NULL) IS NULL THEN '5'
ELSE 'Error: reached the maximum testing depth'
END AS relations

Best Answer

A recursive CTE seems the way to go.
Assuming your path has no cycles. Else it needs more work to detect cycles. The array solution below can readily be adapted.

Test setup

Building on this simplified layout:

CREATE TABLE t1 (t1_id int, objid text);
INSERT INTO t1 VALUES
 (1,'aaa')
,(2,'bbb')
,(3,'ccc')
,(4,'ddd')
,(5,'eee')
,(6,'fff')
,(7,'ggg')
,(8,'hhh');

CREATE TABLE t2 (t2_id int, t1_id int, objid text);
INSERT INTO t2 VALUES
 (1,3,'aaa')
,(2,4,'aaa')
,(3,7,'hhh')
,(4,8,'ccc');

Two different solutions:

Solution with string as path

WITH RECURSIVE cte AS (
   SELECT t.t1_id AS start_id
        , t2.t2_id::text || '(t2)' || COALESCE(' ->' || t1.t1_id || '(t1)', '') AS path
        , t1.t1_id
   FROM   t1 t
   LEFT   JOIN t2 USING (t1_id)
   LEFT   JOIN t1 ON t1.objid = t2.objid

   UNION ALL
   SELECT c.start_id
        , c.path || ' ->' || t2.t2_id || '(t2)' || COALESCE(' ->' || t1.t1_id || '(t1)', '')
        , t1.t1_id
   FROM   cte c
   JOIN   t2      USING (t1_id)
   LEFT   JOIN t1 USING (objid)
   )
SELECT DISTINCT ON (start_id)
       start_id, path
FROM   cte
ORDER  BY start_id, path DESC;

Result:

start_id   path
1          
2          
3          1(t2) ->1(t1)
4          2(t2) ->1(t1)
5          
6          
7          3(t2) ->8(t1) ->4(t2) ->3(t1) ->1(t2) ->1(t1)
8          4(t2) ->3(t1) ->1(t2) ->1(t1)

Table names are redundant, obviously. I added them for good looks.

Solution with inverted array

Rightmost element is first t2_id, keep alternating from right to left.

WITH RECURSIVE cte AS (
   SELECT t.t1_id AS start_id, ARRAY[t1.t1_id, t2.t2_id] AS path
   FROM   t1 t
   LEFT   JOIN t2 USING (t1_id)
   LEFT   JOIN t1 ON t1.objid = t2.objid

   UNION ALL
   SELECT c.start_id, t1.t1_id || (t2.t2_id || path)
   FROM   cte c
   JOIN   t2 ON t2.t1_id = path[1]
   LEFT   JOIN t1 USING (objid)
   )
SELECT DISTINCT ON (start_id)
       start_id, array_remove(path, NULL) AS path
FROM   cte
ORDER  BY start_id, array_length(path, 1) DESC;

Result:

start_id   path
1          {}
2          {}
3          {1,1}
4          {1,2}
5          {}
6          {}
7          {1,1,3,4,8,3}
8          {1,1,3,4}

array_remove() requires Postgres 9.3+.

Inverted the array to need one less columns. By putting the last element first, I can reference path[1] in the next step. Not sure if that's cheaper, would need a test ...

Shorter code, but probably slower. I suspect array handling is more expensive. Easier to adapt if we need to observe cycles.

SQL Fiddle.

Major points

  • We are alternating between two tables.
    To make this recursive, one step needs to cover two hops (from t1 -> t2 and back t2 -> t1).

  • The initial SELECT uses 2x LEFT JOIN to include all rows like in your example result.
    The recursive part JOIN to stop the loop when no match is found. The hop back uses LEFT JOIN again.