I have a PostgreSQL table with id and clusters like this:
CREATE TABLE w (id bigint, clst int);
INSERT INTO w (id,clst)
VALUES
(1,0),
(1,4),
(2,1),
(2,2),
(2,3),
(3,2),
(4,2),
(5,4),
(6,5);
If you aggregate clusters grouped by id, you can see that there are overlapping values in the cluster arrays:
select id, array_agg(clst) clst from w group by id order by id;
id | clst
----+---------
1 | {0,4}
2 | {1,2,3}
3 | {2}
4 | {2}
5 | {4}
6 | {5}
i.e. cluster 4 covers id 1 and 5, cluster 2 covers id 2, 3 and 4, whereas cluster 5 corresponds only to one id.
How can I now aggregate ids grouped by cluster arrays overlapping?
i.e. the expected result is:
id | clst
---------+-------
{1,5} | {0,4,4}
{2,3,4} | {1,2,3,2,2}
{6} | {5}
I don't care much about the cluster column just need ids properly aggregated.
There is no restriction on number of possible overlappings. Number of clusters per id is not restricted either (it can be hundreds or even more). Clusters are assined to ids not sequenctially.
There are millions of rows in the table!!!
Using PostgreSQL 11.
Best Answer
In that case we can make use of the
uniq
andsort
functions in the intarray extension:db<>fiddle here
Bear in mind that this is unlikely to perform well on millions of rows.