Postgresql – Group by array overlapping

postgresql

I have a PostgreSQL table with id and clusters like this:

CREATE TABLE w (id bigint, clst int);
INSERT INTO w (id,clst)
VALUES 
  (1,0),
  (1,4),
  (2,1),
  (2,2),
  (2,3),
  (3,2),
  (4,2),
  (5,4),
  (6,5);

If you aggregate clusters grouped by id, you can see that there are overlapping values in the cluster arrays:

select id, array_agg(clst) clst from w group by id order by id;
 id |  clst
----+---------
  1 | {0,4}
  2 | {1,2,3}
  3 | {2}
  4 | {2}
  5 | {4}
  6 | {5}

i.e. cluster 4 covers id 1 and 5, cluster 2 covers id 2, 3 and 4, whereas cluster 5 corresponds only to one id.

How can I now aggregate ids grouped by cluster arrays overlapping?
i.e. the expected result is:

 id      | clst
---------+-------
 {1,5}   | {0,4,4}
 {2,3,4} | {1,2,3,2,2}
 {6}     | {5}

I don't care much about the cluster column just need ids properly aggregated.

There is no restriction on number of possible overlappings. Number of clusters per id is not restricted either (it can be hundreds or even more). Clusters are assined to ids not sequenctially.

There are millions of rows in the table!!!

Using PostgreSQL 11.

Best Answer

I don't care much about the cluster column just need ids properly aggregated.

In that case we can make use of the uniq and sort functions in the intarray extension:

with recursive a as (
  select id, array_agg(distinct clst) clst from w group by id)
, t(id,pid,clst) as (
  select id,id,clst from a
  union all
  select t.id,a.id,t.clst|a.clst
  from t join a on a.id<>t.pid and t.clst&&a.clst and not t.clst@>a.clst)
, d as (
  select distinct on(id) id, clst from t order by id, cardinality(clst) desc)
select array_agg(id), clst from d group by clst;
array_agg | clst   
:-------- | :------
{6}       | {5}    
{2,3,4}   | {1,2,3}
{1,5}     | {0,4}  

db<>fiddle here

Bear in mind that this is unlikely to perform well on millions of rows.