Postgresql – Grouping records with with consecutive start and end dates -Postgresql

postgresqlpostgresql-9.6

We are migrating to a new record management system and I'm weeding through the issues with the data migration. Source data and destination data are all in Postgresql

In the source database we have multiple rows for the same vendor for each authorization code for that vendor and the start and end date of that authorization.

+----------+---------------+------------+-----------+
|  Vendor  | Authorization | Start Date | End Date  | 
+----------+---------------+------------+-----------+
| Vendor 1 | #####         | 1/1/2017   | 2/28/2017 | 
| Vendor 1 | #####         | 3/1/2017   | 4/15/2017 |
| Vendor 2 | #####         | 4/16/2017  | 5/31/2017 |
| Vendor 2 | #####         | 6/1/2017   | 7/12/2017 | 
| Vendor 1 | #####         | 7/13/2017  | 9/30/2017 |
+----------+---------------+------------+-----------+

In the destination database we have records for the active vendor relationship at a given time with a start date and an end date. (Auth # start and end dates are tracked separately) After migration the example listed previously should look like this.

+----------+------------+-----------+
|  Vendor  | Start Date | End Date  |
+----------+------------+-----------+
| Vendor 1 | 1/1/2017   | 4/15/2017 |
| Vendor 2 | 4/16/2017  | 7/12/2017 |
| Vendor 1 | 7/13/2017  | 9/30/2017 |
+----------+------------+-----------+

I think I should be able to accomplish this joining the table back on it self n-1 number of times where n is the largest number of consecutive authorization numbers for the same vendor which is probably around 10, which is doable but definitely not preferable.

Is there a more common method to group data by consecutive start and end dates?

Best Answer

You could use a GROUPING AND WINDOW solution in this way:

create table tbl (vendor text, auth text, StartDate Date, EndDate Date);

✓

insert into tbl values
('Vendor 1', '#####', '20170101', '20170228'),
('Vendor 1', '#####', '20170301', '20170415'),
('Vendor 2', '#####', '20170416', '20170531'),
('Vendor 2', '#####', '20170601', '20170712'),
('Vendor 1', '#####', '20170703', '20170930');

5 rows affected

select vendor, min(startdate) as startdate, max(enddate) as enddate, grp
from (
      select vendor, auth, startdate, enddate, 
             sum(rst) over (order by vendor, startdate) as grp
      from (
             select vendor, auth, startdate, enddate, 
                    case when coalesce(lag(enddate) over (partition by vendor order by vendor, startdate), startdate) + 1 <> startdate then 1 end rst
             from   tbl
           ) t1
     ) t2
group by grp, vendor
order by startdate

vendor   | startdate  | enddate    | grp
:------- | :--------- | :--------- | --:
Vendor 1 | 2017-01-01 | 2017-04-15 |   1
Vendor 2 | 2017-04-16 | 2017-07-12 |   3
Vendor 1 | 2017-07-03 | 2017-09-30 |   2

db<>fiddle here

Related Solutions

Postgresql – Cumulative data and dates – the joins aren’t working

As inspired by a_horse_with_no_name, I moved the right part of the join to a subquery so the where clause doesn't interfere with the join condition

This works as intended:

WITH dates  as (SELECT min(date) as start_date,
                       max(date) as end_date
                from training_training
                where athlete_id = 1)

SELECT  distinct(d.date),
   sum(distance) OVER (ORDER BY d.date)
FROM
  (
    SELECT generate_series(start_date, end_date, interval '1 day') as date
    FROM dates
  ) d
LEFT JOIN
  (
    SELECT
      date,
      distance
    FROM training_training
    WHERE athlete_id = 1
          AND kind IN ('t', 'd', 'i', 'w')
  ) t
ON d.date = t.date

GROUP BY d.date, distance
ORDER BY d.date

Postgresql – How to deal with datetime ranges in a table with separate columns for date and time

You can use the + operator.

SELECT pk,ev_date,ev FROM events;

 pk |  ev_date   |    ev    
----+------------+----------
  1 | 2016-02-19 | 01:00:00
  2 | 2016-02-19 | 02:00:00
  3 | 2016-02-19 | 05:00:00
  4 | 2016-02-19 | 12:00:00
  5 | 2016-02-19 | 18:00:00
  6 | 2016-02-19 | 23:00:00
  7 | 2016-02-20 | 01:00:00
  8 | 2016-02-20 | 05:00:00
  9 | 2016-02-20 | 12:00:00
 10 | 2016-02-20 | 18:00:00
(10 rows)

SELECT pk, ev_date, ev 
FROM events 
WHERE (ev_date + ev) 
    BETWEEN ('2016-02-19 04:00:00') 
        AND ('2016-02-20 02:00:00');

 pk |  ev_date   |    ev    
----+------------+----------
  3 | 2016-02-19 | 05:00:00
  4 | 2016-02-19 | 12:00:00
  5 | 2016-02-19 | 18:00:00
  6 | 2016-02-19 | 23:00:00
  7 | 2016-02-20 | 01:00:00
(5 rows)

Don't forget to create the index below:

CREATE INDEX events_ts_idx ON events ((ev_date + ev));
ANALYZE events;

I've inserted many dummy rows, so I show the result of EXPLAIN:

EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events  WHERE (ev_date + ev) 
    BETWEEN ('2016-02-19 23:50:00') 
        AND ('2016-02-20 00:01:00');
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using events_ts_idx on events  (cost=0.29..8.52 rows=8 width=16) (actual time=0.014..0.029 rows=42 loops=1)
   Index Cond: (((ev_date + ev) >= '2016-02-19 23:50:00'::timestamp without time zone) AND ((ev_date + ev) <= '2016-02-20 00:01:00'::timestamp without time zone))
 Planning time: 0.082 ms
 Execution time: 0.053 ms
(4 rows)

For comparison, I've created other index and tried other form:

CREATE INDEX events_ts2_idx ON events (ev_date,ev);
ANALYZE events;

EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events  WHERE (ev_date,ev) 
    BETWEEN ('2016-02-19','23:50:00') 
        AND ('2016-02-20','0:01:00');
                             QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on events  (cost=189.50..511.36 rows=7143 width=16) (actual time=0.027..0.042 rows=42 loops=1)
   Recheck Cond: ((ROW(ev_date, ev) >=ROW('2016-02-19'::date,'23:50:00'::time without time zone)) AND (ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
   Heap Blocks: exact=7
   ->  Bitmap Index Scan on events_ts2_idx  (cost=0.00..187.72rows=7143 width=0) (actual time=0.019..0.019 rows=42 loops=1)
         Index Cond: ((ROW(ev_date, ev) >= ROW('2016-02-19'::date,'23:50:00'::time without time zone))AND(ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
 Planning time: 0.079 ms
 Execution time: 0.071 ms
(7 rows)

According to my investigation, my way (using + operator) is better. I recommend to compare with both ways on your machine.

Best Answer

Related Solutions

Postgresql – Cumulative data and dates – the joins aren’t working

Postgresql – How to deal with datetime ranges in a table with separate columns for date and time

Related Question