Postgresql – Grouping records with with consecutive start and end dates -Postgresql

postgresqlpostgresql-9.6

We are migrating to a new record management system and I'm weeding through the issues with the data migration. Source data and destination data are all in Postgresql

In the source database we have multiple rows for the same vendor for each authorization code for that vendor and the start and end date of that authorization.

+----------+---------------+------------+-----------+
|  Vendor  | Authorization | Start Date | End Date  | 
+----------+---------------+------------+-----------+
| Vendor 1 | #####         | 1/1/2017   | 2/28/2017 | 
| Vendor 1 | #####         | 3/1/2017   | 4/15/2017 |
| Vendor 2 | #####         | 4/16/2017  | 5/31/2017 |
| Vendor 2 | #####         | 6/1/2017   | 7/12/2017 | 
| Vendor 1 | #####         | 7/13/2017  | 9/30/2017 |
+----------+---------------+------------+-----------+

In the destination database we have records for the active vendor relationship at a given time with a start date and an end date. (Auth # start and end dates are tracked separately) After migration the example listed previously should look like this.

+----------+------------+-----------+
|  Vendor  | Start Date | End Date  |
+----------+------------+-----------+
| Vendor 1 | 1/1/2017   | 4/15/2017 |
| Vendor 2 | 4/16/2017  | 7/12/2017 |
| Vendor 1 | 7/13/2017  | 9/30/2017 |
+----------+------------+-----------+

I think I should be able to accomplish this joining the table back on it self n-1 number of times where n is the largest number of consecutive authorization numbers for the same vendor which is probably around 10, which is doable but definitely not preferable.

Is there a more common method to group data by consecutive start and end dates?

Best Answer

You could use a GROUPING AND WINDOW solution in this way:

create table tbl (vendor text, auth text, StartDate Date, EndDate Date);
insert into tbl values
('Vendor 1', '#####', '20170101', '20170228'),
('Vendor 1', '#####', '20170301', '20170415'),
('Vendor 2', '#####', '20170416', '20170531'),
('Vendor 2', '#####', '20170601', '20170712'),
('Vendor 1', '#####', '20170703', '20170930');
5 rows affected
select vendor, min(startdate) as startdate, max(enddate) as enddate, grp
from (
      select vendor, auth, startdate, enddate, 
             sum(rst) over (order by vendor, startdate) as grp
      from (
             select vendor, auth, startdate, enddate, 
                    case when coalesce(lag(enddate) over (partition by vendor order by vendor, startdate), startdate) + 1 <> startdate then 1 end rst
             from   tbl
           ) t1
     ) t2
group by grp, vendor
order by startdate
vendor   | startdate  | enddate    | grp
:------- | :--------- | :--------- | --:
Vendor 1 | 2017-01-01 | 2017-04-15 |   1
Vendor 2 | 2017-04-16 | 2017-07-12 |   3
Vendor 1 | 2017-07-03 | 2017-09-30 |   2

db<>fiddle here