Postgresql – Create a grouping based on chains of pairs

postgresql

Say I have two tables:

CREATE TABLE datum
(
  datum_id SERIAL PRIMARY KEY,
  datum_text TEXT NOT NULL
);

CREATE TABLE datum_pair
(
  datum_id1 INTEGER NOT NULL REFERENCES datum (datum_id),
  datum_id2 INTEGER NOT NULL REFERENCES datum (datum_id)
);

datum_pair records related pairs of rows in datum. What I want to do is form identify groups that are all related, including relationships that pass through another row. For example, if the IDs 1 and 3 are related, the IDs 3 and 5 are related, and the IDs 5 and 7 are related, then 1, 3, 5, and 7 would all be in the same group. The number of rows participating in one of these chains is arbitrary.

I will need to perform various aggregations grouped by each chain. I'm not set on how the output should look, but one idea I have is to associate each ID of a chain with a single ID. For example, if 1, 3, 5, and 7 are all in the same chain, then this output would work:

 datum_id | min_id_in_chain
----------+-----------------
     1    |      1
     3    |      1
     5    |      1
     7    |      1

Solutions that offer a different output format but still allow for aggregation by chain are welcome.

There are two additional complications:

  1. datum_pair always contains relationships in both directions. So if there is a row for IDs 1 and 3, there is also a row for 3 and 1. (In my use case, this is actually the result of a self JOIN, but I don't think this is especially relevant to the question.) Each datum should only appear once in the output.
  2. Not all rows in datum participate in a chain; many are solitary. But the final result still needs to include them. I would expect them to be identified as a chain containing only that row, so that aggregation would not combine these together.

I've created a sample SQL Fiddle containing 100 rows. To simplify testing the output, this sample relates all the odd numbered IDs and all the even numbered IDs except for 1 and 99. 1 and 99 are not related to any of the other rows.

Using PostgreSQL 9.3. (I would appreciate mentioning any simplifications provided by features of 9.4.)

Best Answer

That's a complicated puzzle. I've found a solution that starts with a list of all identifiers. Then for each identifier, it creates an array of identifiers that it is paired with. It iteratively expands that array (called "members") until all pairs are included.

; with  recursive list as
        (
        select  id1 as id
        from    pairs
        union
        select  id2
        from    pairs
        )
,       cte as
        (
        select  id
        ,       ARRAY[id] as members
        from    list
        union all
        select  cte.id
        ,       members || ARRAY[p.id1, p.id2]
        from    cte
        join    pairs p
        on      (
                    cte.members @> ARRAY[p.id1] 
                    and not cte.members @> ARRAY[p.id2]
                )
                or
                (
                    not cte.members @> ARRAY[p.id1]
                    and cte.members @> ARRAY[p.id2]
                )
        )
select  id
,       min(v) as min_id
from    cte
cross join lateral
        unnest(cte.members) v
group by
        id
order by
        id
;

Example at SQL Fiddle. Hopefully someone else can contribute a simpler, more elegant, or better performing solution!