Postgresql – Merging two tables by minimal interval values

gaps-and-islandspostgresql

I have two tables with two columns keeping tabs of categorical values, such as:

Table 1
+----+-------+-----+-----------+
| ID | Begin | End | Condition |
+----+-------+-----+-----------+
|  1 |     1 |   8 |    Normal |
|  2 |     8 |  23 |  Critical |
|  3 |    23 |  30 |    Normal |
+----+-------+-----+-----------+

Table 2
+----+-------+-----+------------+
| ID | Begin | End | Supervisor |
+----+-------+-----+------------+
|  1 |     1 |  14 |       John |
|  2 |    14 |  30 |     Janice |
+----+-------+-----+------------+

These Begin and End columns represent a continuous interval in which the value is valid. In the above example, the interval would be days in a month, so according to Table 1 I'd have a normal condition from day 1 to day 8, critical from day 8 to day 23, and normal again from the 23rd to the 30th. And I know from Table 2 that John was supervising from the 1st to the 14th, and Janice took over from the 14th to the 30th.

What I want is to merge these two tables, to have both Condition and Supervisor values in the same table, with the minimal interval for each pairing of values. So, this:

Merged Table
+----+-------+-----+-----------+------------+
| ID | Begin | End | Condition | Supervisor |
+----+-------+-----+-----------+------------+
|  1 |     1 |   8 |    Normal |       John |
|  2 |     8 |  14 |  Critical |       John |
|  3 |    14 |  23 |  Critical |     Janice |
|  4 |    23 |  30 |    Normal |     Janice |
+----+-------+-----+-----------+------------+

What can be guaranteed of the tables is that:

  • each will have its respective "Begin" and "End" fields.
  • those will have the same span (the "Begin" value of the first row and the "End" value of the last row are the same for every table).
  • that every interval is sequential (the "End" value of row n is always equal to the "Begin" value of row n+1).
  • that the value of "End" will always be higher than "Begin" for a given row.

I had this worked out programatically in python, but I'm trying to scrap that script from my workflow and do everything directly in the DB. Ultimately I could replicate my python function in plpgsql, but I wonder if there is a more SQL-esque way of achieving this?

Best Answer

I'd approach this in three logical stages:

  1. Expand the rows in both tables for all the days in the month and join them together.
  2. Assign a 'group' to each series of rows where condition and supervisor don't change for a period (a "gaps and islands" problem).
  3. Group the results.

So the solution looks like this:

create table t1(
  id serial primary key
, begin_on integer
, end_on integer
, condition text
);

insert into t1(begin_on,end_on,condition)
values (1,8,'Normal')
     , (8,23,'Critical')
     , (23,30,'Normal');

create table t2(
  id serial primary key
, begin_on integer
, end_on integer
, supervisor text
);

insert into t2(begin_on,end_on,supervisor)
values (1,14,'John')
     , (14,30,'Janice');
select min(g) begin_on, max(g)+1 end_on, condition, supervisor
from( select g
           , condition
           , supervisor
           , row_number() over (order by g) 
             - row_number() over (partition by t1.id, t2.id order by g) grp
      from generate_series(1,30) g
           join t1 on g>=t1.begin_on and g<t1.end_on
           join t2 on g>=t2.begin_on and g<t2.end_on ) z
group by condition, supervisor, grp
order by begin_on;
begin_on | end_on | condition | supervisor
-------: | -----: | :-------- | :---------
       1 |      8 | Normal    | John      
       8 |     14 | Critical  | John      
      14 |     23 | Critical  | Janice    
      23 |     30 | Normal    | Janice    

dbfiddle here

I am assuming this is a cut-down example of your real-world problem or I would also suggest changing the way you store the data in the first place, perhaps using date ranges instead of integers.