Postgresql – Create a grouping based on chains of pairs

postgresql

Say I have two tables:

CREATE TABLE datum
(
  datum_id SERIAL PRIMARY KEY,
  datum_text TEXT NOT NULL
);

CREATE TABLE datum_pair
(
  datum_id1 INTEGER NOT NULL REFERENCES datum (datum_id),
  datum_id2 INTEGER NOT NULL REFERENCES datum (datum_id)
);

datum_pair records related pairs of rows in datum. What I want to do is form identify groups that are all related, including relationships that pass through another row. For example, if the IDs 1 and 3 are related, the IDs 3 and 5 are related, and the IDs 5 and 7 are related, then 1, 3, 5, and 7 would all be in the same group. The number of rows participating in one of these chains is arbitrary.

I will need to perform various aggregations grouped by each chain. I'm not set on how the output should look, but one idea I have is to associate each ID of a chain with a single ID. For example, if 1, 3, 5, and 7 are all in the same chain, then this output would work:

 datum_id | min_id_in_chain
----------+-----------------
     1    |      1
     3    |      1
     5    |      1
     7    |      1

Solutions that offer a different output format but still allow for aggregation by chain are welcome.

There are two additional complications:

datum_pair always contains relationships in both directions. So if there is a row for IDs 1 and 3, there is also a row for 3 and 1. (In my use case, this is actually the result of a self JOIN, but I don't think this is especially relevant to the question.) Each datum should only appear once in the output.
Not all rows in datum participate in a chain; many are solitary. But the final result still needs to include them. I would expect them to be identified as a chain containing only that row, so that aggregation would not combine these together.

I've created a sample SQL Fiddle containing 100 rows. To simplify testing the output, this sample relates all the odd numbered IDs and all the even numbered IDs except for 1 and 99. 1 and 99 are not related to any of the other rows.

Using PostgreSQL 9.3. (I would appreciate mentioning any simplifications provided by features of 9.4.)

Best Answer

That's a complicated puzzle. I've found a solution that starts with a list of all identifiers. Then for each identifier, it creates an array of identifiers that it is paired with. It iteratively expands that array (called "members") until all pairs are included.

; with  recursive list as
        (
        select  id1 as id
        from    pairs
        union
        select  id2
        from    pairs
        )
,       cte as
        (
        select  id
        ,       ARRAY[id] as members
        from    list
        union all
        select  cte.id
        ,       members || ARRAY[p.id1, p.id2]
        from    cte
        join    pairs p
        on      (
                    cte.members @> ARRAY[p.id1] 
                    and not cte.members @> ARRAY[p.id2]
                )
                or
                (
                    not cte.members @> ARRAY[p.id1]
                    and cte.members @> ARRAY[p.id2]
                )
        )
select  id
,       min(v) as min_id
from    cte
cross join lateral
        unnest(cte.members) v
group by
        id
order by
        id
;

Example at SQL Fiddle. Hopefully someone else can contribute a simpler, more elegant, or better performing solution!

Related Solutions

Postgresql – Efficiently select beginning and end of multiple contiguous ranges in Postgresql query

How about using with recursive

test view:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

query:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

result:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

I'd be interested to know how that performs on your billion row table.

Postgresql – Best way of finding rows referencing a given id on PostgreSQL

I suggest your first option, with two improvements and some simplifications.

(
SELECT 1      -- irrelevant what you select here
FROM   client_category_price
WHERE  sellable_id = '9bc202ca-f7c1-11e2-a751-062b1fc90460'
LIMIT  1      -- may be redundant
)
UNION ALL     -- not just UNION

  ...

UNION ALL
(
SELECT 1
FROM   work_order_item
WHERE  sellable_id = '9bc202ca-f7c1-11e2-a751-062b1fc90460'
LIMIT  1
)
LIMIT  1;      -- this one is crucial

Given that all you want to know is

if any of those (table, column) have my row's id there, which would prevent its deletion.

You don't need a full list of violating rows. Stop searching at the first one. All you need to do is add another LIMIT 1 at the end of the query. This way, Postgres skips rest of the query as soon as the first row is found. You probably don't need LIMIT 1 for each SELECT, just the one at the end. Test without, it may produce different query plans.
Use UNION ALL instead of UNION. Faster.
Some other simplifications.

Best Answer

Related Solutions

Postgresql – Efficiently select beginning and end of multiple contiguous ranges in Postgresql query

Postgresql – Best way of finding rows referencing a given id on PostgreSQL

Related Question