Postgresql – Grouping on any one of multiple columns in Postgres

aggregatepostgresqlpostgresql-9.5

Is it possible to create some sort of chain of grouping in Postgres? Let's say I have the following chart:

CREATE TABLE foo AS
SELECT row_number() OVER () AS id, *
FROM ( VALUES
  ( 'X', 'D', 'G', 'P' ),
  ( 'F', 'D', 'L', 'M' ),
  ( 'X', 'N', 'R', 'S' ),
  ( 'Y', 'I', 'W', NULL ),
  ( 'U', 'Z', 'E', NULL )
) AS f(a,b,c,d);

id | a | b | c | d
------------------
 1 | X | D | G | P
 2 | F | D | L | M
 3 | X | N | R | S
 4 | Y | I | W | 
 5 | U | Z | E |

I want to somehow craft a GROUP BY that yields three groups:

1, 2 and 3 together
- 1 and 2 because of a common D in the b column
- 1 and 3 because of a common X in the a column
4 alone (no common values in any of the columns; nulls shouldn't match)
5 alone (no common values in any of the columns; nulls shouldn't match)

I'm currently using Postgres 9.5, but we'll upgrade eventually to 9.6, so if there's anything in there that'll help me, I'm open to hearing it.

In other words, I'm looking for something like (let's say I used array_agg(DISTINCT a), etc. to keep display simpler):

   ids    |     as     |     bs     |       cs        |      ds
-----------------------------------------------------------------------
{1, 2, 3} | {'X', 'F'} | {'D', 'N'} | {'G', 'L', 'R'} | {'P', 'M', 'S'}
{4}       | {'Y'}      | {'I'}      | {'W'}           | {NULL}
{5}       | {'U'}      | {'Z'}      | {'E'}           | {NULL}

(I'm not exactly sure how the nulls would display so don't get too hung up on that; the important point is that they shouldn't match each other.)

When I use GROUP BY CUBE (a, b, c, d), I get way more than three results…ditto GROUP BY ROLLUP and GROUP BY GROUPING SETS.

Is there an elegant way in Postgres? I can imagine how you would do it in Ruby via Active Record (loop through every record, group it with previous grouped sets that match), but I'd like to keep this in Postgres if possible.

Best Answer

Another recursive solution that:

first creates the adjacency list of the connected graph of ids,
then finds its transitive closure (this is the recursive part)
and then groups by (once) to find the connected components that each node belongs to
and joins to the table again and group by (again) to gather the values from all nodes of each connected component.

Initial data (copied from Jack Douglas' solution):

begin;
create schema stack;
set search_path=stack;

create table foo as
select *
from (values (1,'X','D','G','P')
           , (2,'F','D','L','M')
           , (3,'X','N','R','S')
           , (4,'Y','I','W',null)
           , (5,'U','Z','E',null) ) AS f(id,a,b,c,d);

The query:

with recursive 
  al (tail, head) as                     -- adjacency list 
  ( select f.id, g.id 
    from foo as f join foo as g
      on (f.a = g.a or f.b = g.b or f.c = g.c or f.d = g.d) 
  ),
  tc (tail, head) as                     -- transitive closure
  ( select * from al
    union distinct
    select f.tail, g.head 
    from al as f join tc as g on f.head = g.tail
  ) ,
  cc (head, ids) as                      -- group once
  ( select head, array_agg(distinct tail order by tail) as ids
    from tc
    group by head
  ) 
select                                   -- group twice
    ids,
    array_agg(distinct a order by a) as a,
    array_agg(distinct b order by b) as b,
    array_agg(distinct c order by c) as c,
    array_agg(distinct d order by d) as d
from
  cc join foo on cc.head = foo.id
group by ids ;

┌─────────┬───────┬───────┬─────────┬─────────┐
│   ids   │   a   │   b   │    c    │    d    │
├─────────┼───────┼───────┼─────────┼─────────┤
│ {1,2,3} │ {F,X} │ {D,N} │ {G,L,R} │ {M,P,S} │
│ {4}     │ {Y}   │ {I}   │ {W}     │ {NULL}  │
│ {5}     │ {U}   │ {Z}   │ {E}     │ {NULL}  │
└─────────┴───────┴───────┴─────────┴─────────┘

Cleanup:

rollback;

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

You are getting at something that is missing from standard normalization discussions, namely the constraint dependencies. In general wider tables provide greater possibilities here than narrower tables. So my view is that the sorts of questions you are asking in fact highlight good reasons to denormalize. I would go with your first solution (the one you are leaning towards right now).

In my view good database design generally normalizes as far as possible but ensures that all columns necessary for proper data constraints are included. You can do some of this with composite foreign keys if you don't mind adding additional unique constraints on the parent tables. Leveraging data constraints an important part of database design and one should not sacrifice that for the sake of normalization that looks good in theory.

PostgreSQL – Using HAVING ANY in Aggregate Queries

You could use either:

HAVING MAX(rnaseq.fold_change) > 4

or the Postgres aggregate function bool_or:

HAVING bool_or( rnaseq.fold_change > 4 )

Best Answer

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

PostgreSQL – Using HAVING ANY in Aggregate Queries

Related Question