Postgresql – Grouping rows on field A or field B

postgresql

I have data structured as follows:

create table T 
( field_1 char(1) not null
, field_2 int not null
,    primary key(field_1, field_2)
);

insert into T (field_1, field_2)
values ('A',1)
      ,('A',2)
      ,('B',3)
      ,('C',3)
      ,('D',4)   
      ,('E',4)
      ,('E',5)
      ,('F',6)
      ,('G',1);

I would like to obtain sets that have either field_1 or field_2 in common, like this:

field_1     | field_2   | set
-------------------------------
A           | 1         | one
A           | 2         | one
G           | 1         | one
-------------------------------
B           | 3         | two
C           | 3         | two
-------------------------------
D           | 4         | three
E           | 4         | three
E           | 5         | three
-------------------------------
F           | 6         | four

Basically, rows having common values in field_1 OR field_2 should be assigned to the same ID. I feel like this is a very basic operation but for some reason I cannot wrap my head around the proper way to do it.

Any clues?

Best Answer

Here's one attempt that I'm unsatisfied with for several reasons, but it's a start. The idea is to calculate the transitive closure of the graph and then use the minimum field_2 for each field_1 as the group. The latter uses the fact that for two nodes either share the same closure, or the closures are disjoint. As an example:

{(a,1),(a,2),(b,2),(b,3)}

is an impossible transitive closure since if a and b share 2, they must also share 1 and 3:

I used Db2 (which does not support the keyword recursive in CTE as well as ansi join in the CTE)

CREATE TABLE t (x char(1) not null, y int not null, primary key (x,y));
INSERT INTO t (x,y)
VALUES ('A',1) 
      ,('A',2)
      ,('B',3)
      ,('C',3)
      ,('D',4)   
      ,('E',4)
      ,('E',5)
      ,('F',6)
      ,('G',1);

with tc (field_1,field_2,n) as (
    select field_1,field_2,0 from t
    union all
    select t1.field_1, t3.field_2,n+1
    from tc as t1, t as t2, t as t3
    where t1.field_2 = t2.field_2
      and t1.field_1 < t2.field_1
      and t2.field_1 = t3.field_1
      and t2.field_2 < t3.field_2
      and t1.field_2 < t2.field_2
      and n <= (select count(1) from t)
), min_closure(field_1,field_2) as (
    -- remove duplicates
    select distinct field_1,field_2 from tc
)
select t.field_1, t.field_2
     , ( select min(field_2)
         from min_closure mc1
         where t.field_1 = mc1.field_1 ) as grp
from t
order by grp;

A           1           1
G           1           1
A           2           1
B           3           3
C           3           3
D           4           4
E           4           4
E           5           4
F           6           6

The group can be further adjusted, but I'll leave it like that.

Major problem is that the CTE only sees the previous iteration, so it is very likely that we will add a lot of redundant edges to the tc. For a large graph that won't be very pleasant.

Edit: attempt to minimize impact by only looking in one direction in graph

Example of a longer chain:

delete from t;
insert into t (x,y) values ('F',5),('F',6),('G',6),('G',7),('H',7),('H',8);
with tc (x,y,n) as (select x,y,0 ...)
select distinct t.x, t.y, min(mc1.y) over (partition by mc1.x) as grp 
from min_closure mc1 join t 
    on t.x = mc1.x order by 3;

F           5           5
F           6           5
G           6           5
G           7           5
H           7           5
H           8           5

EDIT: adjusted for PostgreSQL dialect:

with recursive tc (field_1,field_2,n) as (
    select field_1,field_2,0 from t
    union all 
    select t1.field_1, t3.field_2,n+1
    from tc as t1
    join t as t2
        on t1.field_2 = t2.field_2
     and t1.field_1 < t2.field_1 
    join t as t3
        on t2.field_1 = t3.field_1
       and t2.field_2 < t3.field_2 
    where t1.field_2 < t2.field_2 
      and n <= (select count(1) from t)
), min_closure(field_1,field_2) as (
    -- remove duplicates
    select distinct field_1,field_2 from tc
) 
select t.field_1, t.field_2
     , ( select min(field_2) 
         from min_closure mc1
         where t.field_1 = mc1.field_1 ) as grp
from t 
order by 1;

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

You are getting at something that is missing from standard normalization discussions, namely the constraint dependencies. In general wider tables provide greater possibilities here than narrower tables. So my view is that the sorts of questions you are asking in fact highlight good reasons to denormalize. I would go with your first solution (the one you are leaning towards right now).

In my view good database design generally normalizes as far as possible but ensures that all columns necessary for proper data constraints are included. You can do some of this with composite foreign keys if you don't mind adding additional unique constraints on the parent tables. Leveraging data constraints an important part of database design and one should not sacrifice that for the sake of normalization that looks good in theory.

PostgreSQL – Return Unique Column Combinations Based on Where Clause

The execution plan shown does not seem to match the big SELECT DISTINCT query because the Sort and Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.

For the 2nd part of the question:

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.

In one single query, this could be written like this:

WITH logs as (
  SELECT log_time_mapped, syslog_priority_mapped, 
     operation_mapped, message_code_mapped, protocol_mapped, 
     source_ip_mapped, destination_ip_mapped, 
     source_port_mapped, destination_port_mapped, 
     destination_service_mapped, direction_mapped, 
     connections_built_mapped, connections_torn_down_mapped, 
     hourofday_mapped, meridiem_mapped 
  FROM firewall_logs_mapped 
  WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14,  hourofday_mapped, meridiem_mapped FROM logs
;

The first SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT eliminates duplicates, so the client side doesn't have to do it. The UNION ALL concatenates the results without any further processing.

However it's a heavy query and it's dubious that it would be any faster than what you're already doing.

The WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:

SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;

Best Answer

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

PostgreSQL – Return Unique Column Combinations Based on Where Clause

Related Question