Postgresql – Grouping rows on field A or field B

postgresql

I have data structured as follows:

create table T 
( field_1 char(1) not null
, field_2 int not null
,    primary key(field_1, field_2)
);

insert into T (field_1, field_2)
values ('A',1)
      ,('A',2)
      ,('B',3)
      ,('C',3)
      ,('D',4)   
      ,('E',4)
      ,('E',5)
      ,('F',6)
      ,('G',1);

I would like to obtain sets that have either field_1 or field_2 in common, like this:

field_1     | field_2   | set
-------------------------------
A           | 1         | one
A           | 2         | one
G           | 1         | one
-------------------------------
B           | 3         | two
C           | 3         | two
-------------------------------
D           | 4         | three
E           | 4         | three
E           | 5         | three
-------------------------------
F           | 6         | four

Basically, rows having common values in field_1 OR field_2 should be assigned to the same ID. I feel like this is a very basic operation but for some reason I cannot wrap my head around the proper way to do it.

Any clues?

Best Answer

Here's one attempt that I'm unsatisfied with for several reasons, but it's a start. The idea is to calculate the transitive closure of the graph and then use the minimum field_2 for each field_1 as the group. The latter uses the fact that for two nodes either share the same closure, or the closures are disjoint. As an example:

{(a,1),(a,2),(b,2),(b,3)}

is an impossible transitive closure since if a and b share 2, they must also share 1 and 3:

I used Db2 (which does not support the keyword recursive in CTE as well as ansi join in the CTE)

CREATE TABLE t (x char(1) not null, y int not null, primary key (x,y));
INSERT INTO t (x,y)
VALUES ('A',1) 
      ,('A',2)
      ,('B',3)
      ,('C',3)
      ,('D',4)   
      ,('E',4)
      ,('E',5)
      ,('F',6)
      ,('G',1);

with tc (field_1,field_2,n) as (
    select field_1,field_2,0 from t
    union all
    select t1.field_1, t3.field_2,n+1
    from tc as t1, t as t2, t as t3
    where t1.field_2 = t2.field_2
      and t1.field_1 < t2.field_1
      and t2.field_1 = t3.field_1
      and t2.field_2 < t3.field_2
      and t1.field_2 < t2.field_2
      and n <= (select count(1) from t)
), min_closure(field_1,field_2) as (
    -- remove duplicates
    select distinct field_1,field_2 from tc
)
select t.field_1, t.field_2
     , ( select min(field_2)
         from min_closure mc1
         where t.field_1 = mc1.field_1 ) as grp
from t
order by grp;

A           1           1
G           1           1
A           2           1
B           3           3
C           3           3
D           4           4
E           4           4
E           5           4
F           6           6

The group can be further adjusted, but I'll leave it like that.

Major problem is that the CTE only sees the previous iteration, so it is very likely that we will add a lot of redundant edges to the tc. For a large graph that won't be very pleasant.

Edit: attempt to minimize impact by only looking in one direction in graph

Example of a longer chain:

delete from t;
insert into t (x,y) values ('F',5),('F',6),('G',6),('G',7),('H',7),('H',8);
with tc (x,y,n) as (select x,y,0 ...)
select distinct t.x, t.y, min(mc1.y) over (partition by mc1.x) as grp 
from min_closure mc1 join t 
    on t.x = mc1.x order by 3;

F           5           5
F           6           5
G           6           5
G           7           5
H           7           5
H           8           5

EDIT: adjusted for PostgreSQL dialect:

with recursive tc (field_1,field_2,n) as (
    select field_1,field_2,0 from t
    union all 
    select t1.field_1, t3.field_2,n+1
    from tc as t1
    join t as t2
        on t1.field_2 = t2.field_2
     and t1.field_1 < t2.field_1 
    join t as t3
        on t2.field_1 = t3.field_1
       and t2.field_2 < t3.field_2 
    where t1.field_2 < t2.field_2 
      and n <= (select count(1) from t)
), min_closure(field_1,field_2) as (
    -- remove duplicates
    select distinct field_1,field_2 from tc
) 
select t.field_1, t.field_2
     , ( select min(field_2) 
         from min_closure mc1
         where t.field_1 = mc1.field_1 ) as grp
from t 
order by 1;