Postgresql – Merging two tables by minimal interval values

gaps-and-islandspostgresql

I have two tables with two columns keeping tabs of categorical values, such as:

Table 1
+----+-------+-----+-----------+
| ID | Begin | End | Condition |
+----+-------+-----+-----------+
|  1 |     1 |   8 |    Normal |
|  2 |     8 |  23 |  Critical |
|  3 |    23 |  30 |    Normal |
+----+-------+-----+-----------+

Table 2
+----+-------+-----+------------+
| ID | Begin | End | Supervisor |
+----+-------+-----+------------+
|  1 |     1 |  14 |       John |
|  2 |    14 |  30 |     Janice |
+----+-------+-----+------------+

These Begin and End columns represent a continuous interval in which the value is valid. In the above example, the interval would be days in a month, so according to Table 1 I'd have a normal condition from day 1 to day 8, critical from day 8 to day 23, and normal again from the 23rd to the 30th. And I know from Table 2 that John was supervising from the 1st to the 14th, and Janice took over from the 14th to the 30th.

What I want is to merge these two tables, to have both Condition and Supervisor values in the same table, with the minimal interval for each pairing of values. So, this:

Merged Table
+----+-------+-----+-----------+------------+
| ID | Begin | End | Condition | Supervisor |
+----+-------+-----+-----------+------------+
|  1 |     1 |   8 |    Normal |       John |
|  2 |     8 |  14 |  Critical |       John |
|  3 |    14 |  23 |  Critical |     Janice |
|  4 |    23 |  30 |    Normal |     Janice |
+----+-------+-----+-----------+------------+

What can be guaranteed of the tables is that:

each will have its respective "Begin" and "End" fields.
those will have the same span (the "Begin" value of the first row and the "End" value of the last row are the same for every table).
that every interval is sequential (the "End" value of row n is always equal to the "Begin" value of row n+1).
that the value of "End" will always be higher than "Begin" for a given row.

I had this worked out programatically in python, but I'm trying to scrap that script from my workflow and do everything directly in the DB. Ultimately I could replicate my python function in plpgsql, but I wonder if there is a more SQL-esque way of achieving this?

Best Answer

I'd approach this in three logical stages:

Expand the rows in both tables for all the days in the month and join them together.
Assign a 'group' to each series of rows where condition and supervisor don't change for a period (a "gaps and islands" problem).
Group the results.

So the solution looks like this:

create table t1(
  id serial primary key
, begin_on integer
, end_on integer
, condition text
);

insert into t1(begin_on,end_on,condition)
values (1,8,'Normal')
     , (8,23,'Critical')
     , (23,30,'Normal');

create table t2(
  id serial primary key
, begin_on integer
, end_on integer
, supervisor text
);

insert into t2(begin_on,end_on,supervisor)
values (1,14,'John')
     , (14,30,'Janice');

select min(g) begin_on, max(g)+1 end_on, condition, supervisor
from( select g
           , condition
           , supervisor
           , row_number() over (order by g) 
             - row_number() over (partition by t1.id, t2.id order by g) grp
      from generate_series(1,30) g
           join t1 on g>=t1.begin_on and g<t1.end_on
           join t2 on g>=t2.begin_on and g<t2.end_on ) z
group by condition, supervisor, grp
order by begin_on;

begin_on | end_on | condition | supervisor
-------: | -----: | :-------- | :---------
       1 |      8 | Normal    | John      
       8 |     14 | Critical  | John      
      14 |     23 | Critical  | Janice    
      23 |     30 | Normal    | Janice

dbfiddle here

I am assuming this is a cut-down example of your real-world problem or I would also suggest changing the way you store the data in the first place, perhaps using date ranges instead of integers.

Related Solutions

Postgresql – Single data type for imprecise date values, as allowed by ISO 8601

No, the interval type supports reduced precision but none of the other date/time types do.

Postgres allows you to roll your own with create type but unfortunately wont allow contraints to be added to the type which limits it's usefulness in this scenario. The best I can come up with requires you to repeat check constraints on every field where the fuzzy type is used:

create type preciseness as enum('day', 'month', 'year');
create type fuzzytimestamptz as (ts timestamptz, p preciseness);
create table t( id serial primary key,
                fuzzy fuzzytimestamptz
                    check( (fuzzy).ts is not null 
                           or ((fuzzy).ts is null and (fuzzy).p is not null) ),
                    check((fuzzy).ts=date_trunc('year', (fuzzy).ts) or (fuzzy).p<'year'),
                    check((fuzzy).ts=date_trunc('month', (fuzzy).ts) or (fuzzy).p<'month'),
                    check((fuzzy).ts=date_trunc('day', (fuzzy).ts) or (fuzzy).p<'day') );

insert into t(fuzzy) values (row(date_trunc('year', current_timestamp), 'year'));
insert into t(fuzzy) values (row(date_trunc('month', current_timestamp), 'month'));
insert into t(fuzzy) values (row(date_trunc('day', current_timestamp), 'day'));

select * from t;

 id |              fuzzy
----+----------------------------------
  1 | ("2011-01-01 00:00:00+00",year)
  2 | ("2011-09-01 00:00:00+01",month)
  3 | ("2011-09-23 00:00:00+01",day)

--edit - an example equality operator:

create function fuzzytimestamptz_equality(fuzzytimestamptz, fuzzytimestamptz)
                returns boolean language plpgsql immutable as $$
begin
  return ($1.ts, $1.ts+coalesce('1 '||$1.p, '0')::interval)
         overlaps ($2.ts, $2.ts+coalesce('1 '||$2.p, '0')::interval);
end;$$;
--
create operator = ( procedure=fuzzytimestamptz_equality, 
                    leftarg=fuzzytimestamptz, 
                    rightarg=fuzzytimestamptz );

sample query:

select *, fuzzy=row(statement_timestamp(), null)::fuzzytimestamptz as equals_now,
          fuzzy=row(statement_timestamp()+'1 day'::interval, null)::fuzzytimestamptz as equals_tomorrow,
          fuzzy=row(date_trunc('month', statement_timestamp()), 'month')::fuzzytimestamptz as equals_fuzzymonth,
          fuzzy=row(date_trunc('month', statement_timestamp()+'1 month'::interval), 'month')::fuzzytimestamptz as equals_fuzzynextmonth
from t;
 id |               fuzzy                | equals_now | equals_tomorrow | equals_fuzzymonth | equals_fuzzynextmonth
----+------------------------------------+------------+-----------------+-------------------+-----------------------
  1 | ("2011-01-01 00:00:00+00",year)    | t          | t               | t                 | t
  2 | ("2011-09-01 00:00:00+01",month)   | t          | t               | t                 | f
  3 | ("2011-09-24 00:00:00+01",day)     | t          | f               | t                 | f
  4 | ("2011-09-24 11:45:23.810589+01",) | f          | f               | t                 | f

Postgresql – SQL hourly data aggregation in postgresql

select
  date_trunc('hour', t - interval '1 minute') as interv_start,
  date_trunc('hour', t - interval '1 minute')  + interval '1 hours' as interv_end,
 sum(v)
  from myt 
    group by date_trunc('hour', t - interval '1 minute')
order by interv_start

see sqlfiddle

As for the index: you could try a function index on date_trunc('hour', t - interval '1 minute') but I'm not sure postgresql can use it.

Best Answer

Related Solutions

Postgresql – Single data type for imprecise date values, as allowed by ISO 8601

Postgresql – SQL hourly data aggregation in postgresql

Related Question