Sql-server – Grouping Subsets of Rows with Null Values within an Ordered Set

gaps-and-islandsgroup bysql serversql-server-2008-r2

Lets say we have an table where each row is a day, and it is ordered by this day column. Then we have left joined a membership data set showing which day the members were active (and not).

Lets say our current data set looks like this… Membership was active from day 3-5, inactive from 5-8, and active from day 9 onward etc.

DAY     DATE        MEMBER  ACTIVE
 1      2017-01-01  123     null
 2      2017-01-02  123     null
 3      2017-01-03  123     2017-01-03
 4      2017-01-04  123     2017-01-04
 5      2017-01-05  123     2017-01-05
 6      2017-01-06  123     null
 7      2017-01-07  123     null
 8      2017-01-08  123     null
 9      2017-01-09  123     2017-01-09
10      2017-01-10  123     2017-01-10

…so ACTIVE=null means membership was not active on those days.

With this data structure, I would like to get to a "collapsed" set, showing "spans" of time inactive/active:

MEMBER  MIN(DATE)   MAX(DATE)   STATUS
123,    2017-01-01, 2017-01-02  INACTIVE
123,    2017-01-03, 2017-01-05  ACTIVE
123,    2017-01-06, 2017-01-08  INACTIVE
123,    2017-01-09, 2017-01-10  ACTIVE

I have tried using row_number() to somehow partition out the subsets of a certain status, but in this case, using min()/max() over the rows where ACTIVE is null, treats those as a single group, when in reality, there are several distinct spans of "inactive membership".

How can I distinguish the spans of inactive membership from one-another for grouping purposes?
What technique can I use to achieve that output above?

Here is the script to generate the dummy source data:

CREATE TABLE ##SRC (ID INT, D DATE, MEMBER INT, ACTIVE DATE);

INSERT INTO ##SRC (ID, D, MEMBER, ACTIVE)
SELECT 1, '2017-01-01', 123, NULL UNION 
SELECT 2, '2017-01-02', 123, NULL UNION 
SELECT 3, '2017-01-03', 123, '2017-01-03' UNION 
SELECT 4, '2017-01-04', 123, '2017-01-04' UNION 
SELECT 5, '2017-01-05', 123, '2017-01-05' UNION 
SELECT 6, '2017-01-06', 123, NULL UNION 
SELECT 7, '2017-01-07', 123, NULL UNION 
SELECT 8, '2017-01-08', 123, NULL UNION 
SELECT 9, '2017-01-09', 123, '2017-01-09' UNION 
SELECT 10, '2017-01-10',    123, '2017-01-10' 
;

Best Answer

Your sample data does not match your description and confused me at first. As sp_BlitzErik points out this is an island-and-gap problem. The solution is pretty straightforward if you have access to window functions. First, we can enumerate the table per member alone, let's call this full_order (this happens to be the same as day, but I'll ad it for generality). Second, we can enumerate the table per member and whether they were active on that day, let's call this partial_order

 select day, active, date, member
      , row_number() over (partition by member 
                           order by day) as fullorder
      , row_number() over (partition by member
                          ,case when active is null then 0 else 1 end
                          order by day) as partialorder
 from src

DAY         ACTIVE     MEMBER      FULLORDER            PARTIALORDER        

----------- ---------- ----------- -------------------- --------------------
      1 -                  123                    1                    1
      2 -                  123                    2                    2
      3 01/03/2017         123                    3                    1
      4 01/04/2017         123                    4                    2
      5 01/05/2017         123                    5                    3
      6 -                  123                    6                    3
      7 -                  123                    7                    4
      8 -                  123                    8                    5
      9 01/09/2017         123                    9                    4
     10 01/10/2017         123                   10                    5

If the difference between full_order and partial_order changes, it means that active has changed from null to a value, or vice versa. Therefore we can form a group with this difference. Within each such group we can pick the min(active) and max(active) to form an interval:

select member, grp, min(date), max(active) 
from (
    select day, active, date, member
         , row_number() over (partition by member order by day) 
         - row_number() over (partition by member
                             ,case when active is null then 0 else 1 end 
                              order by day) as grp  
    from src
) 
group by member, grp

MEMBER      GRP                  3          4         
----------- -------------------- ---------- ----------
    123                    0 01/01/2017 -         
    123                    2 01/03/2017 01/05/2017
    123                    3 01/05/2017 -         
    123                    5 01/08/2017 01/10/2017

It's probably easiest to add another level of nesting to get the desired result:

select member, min_active
     , coalesce(max_active, min_active) as max_active
     , case when max_active is null then 'INACTIVE' else 'ACTIVE' end as status 
from (
    select member, grp, min(date) as min_active, max(active) as max_active 
    from (
        select day, active, date, member
             , row_number() over (partition by member order by day) 
             - row_number() over (partition by member
                                 ,case when active is null then 0 else 1 end 
                                 order by day) as grp  
        from src
    ) 
    group by member, grp)

MEMBER      MIN_ACTIVE MAX_ACTIVE STATUS  
----------- ---------- ---------- --------
    123 01/01/2017 01/01/2017 INACTIVE
    123 01/03/2017 01/05/2017 ACTIVE  
    123 01/05/2017 01/05/2017 INACTIVE
    123 01/08/2017 01/10/2017 ACTIVE

Related Solutions

Sql-server – Finding rows with duplicate values

This should return tag the records that need attention. I put the tagging in SELECT, but you could easily turn this into a second CTE and simply select out the payments to clean up.

-- 
-- find all accounts with more than one payment and mark payments to cancel
--
WITH cte_DuplicatePayments AS
(
SELECT COUNT(*) OVER(PARTITION BY accountID) AS numberOfPaymentsPerAccountID
, COUNT(*) OVER(partition BY accountID, amount) AS numberOfPaymentsPerAccountIDAndAmount
, ROW_NUMBER() OVER(partition BY accountID ORDER BY amount asc) AS PaymentsNumberPerAccountID
, *
FROM ScheduledPayment
)
SELECT CASE 
    WHEN numberOfPaymentsPerAccountID != numberOfPaymentsPerAccountIDAndAmount THEN 'MARK AS CANCELLED: Duplicate Payments with amount mismatch' 
    WHEN PaymentsNumberPerAccountID > 1 THEN 'MARK AS CANCELLED: Duplicate Payments with matching amount' 
    ELSE ''
   END AS PaymentAuditAction
, ScheduledPaymentID, accountID, amount,
FROM cte_DuplicatePayments
WHERE numberOfPaymentsPerAccountID > 1

Sql-server – Updating table based on conditions (overlapping dates)

Condition 1:

WITH ord as (
    SELECT ID, CustomerID, CheckInDate, CheckOutDate
        , n = ROW_NUMBER() over(partition by [CustomerID] order by [CheckInDate], [CheckOutDate])
    FROM @data d1
), first as (
    SELECT o1.ID, o1.CustomerID, o1.CheckInDate, o1.CheckOutDate, o1.n
        , m = ROW_NUMBER() over(partition by o1.[CustomerID] order by o1.[CheckInDate], o1.[CheckOutDate])
    FROM ord o1
    INNER JOIN ord o2 ON o1.CustomerID = o2.CustomerID AND o2.n+1 = o1.n AND o1.CheckInDate > o2.CheckOutDate
), groups as (
    SELECT o.ID, o.CustomerID, nx = MIN(coalesce(f.n, 1)), n = MAX(o.n)
        , p = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID)
    FROM ord o
    LEFT JOIN first f ON o.CustomerID = f.CustomerID AND o.n < f.n
    GROUP BY o.ID, o.CustomerID
), dates as (
    SELECT g.CustomerID, g.nx, CheckInDate = MIN(o.CheckInDate)
        , CheckOutDate = CASE WHEN SUM(CASE WHEN o.CheckOutDate IS NULL THEN 1 END) IS NULL THEN MAX(o.CheckOutDate) END
    FROM groups g
    INNER JOIN ord o ON g.ID = o.ID
    GROUP BY g.nx, g.CustomerID
    HAVING COUNT(g.nx) > 1
)
SELECT o.ID, o.CustomerID
    , CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM groups g
INNER JOIN ord o ON g.ID = o.ID
INNER JOIN dates d on g.CustomerID = d.CustomerID AND g.nx = d.nx
ORDER BY ID

This query output the rows that must be updated:

ord = I first partition by CustomerID and order by CheckInDate
first = I join consecutive rows from previous CTE where the next CheckInDate does not overlap previous CheckInDate and CheckDate and I partition and order them
groups = I group by the previous partitioned number in order to know to which group a row belongs
dates = I join with the original data in order to get the first and last date for each groups. Groups with only 1 row are removed
main select output the dates for p=1 or NULL otherwise

Output:

ID  CustomerID  CheckInDate CheckOutDate
1   1           2015-03-04  NULL
3   1           NULL        NULL
4   1           NULL        NULL

Condition 2:

WITH ord as (
    SELECT ID, CustomerID, CheckInDate, CheckOutDate
        , n = ROW_NUMBER() over(partition by [CustomerID] order by [CheckInDate], [CheckOutDate])
    FROM @data d1
), first as (
    SELECT o1.ID, o1.CustomerID, o1.CheckInDate, o1.CheckOutDate, o1.n
        , m = ROW_NUMBER() over(partition by o1.[CustomerID] order by o1.[CheckInDate], o1.[CheckOutDate])
    FROM ord o1
    INNER JOIN ord o2 ON o1.CustomerID = o2.CustomerID AND o2.n+1 = o1.n AND o1.CheckInDate > o2.CheckOutDate
), groups as (
    SELECT o.ID, o.CustomerID, nx = MIN(coalesce(f.n, 1)), n = MAX(o.n)
        , p = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID)
        , last = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID DESC)
    FROM ord o
    LEFT JOIN first f ON o.CustomerID = f.CustomerID AND o.n < f.n
    GROUP BY o.ID, o.CustomerID
), dates as (
    SELECT g.CustomerID, g.nx, CheckInDate = MIN(o.CheckInDate)
        , CheckOutDate = MAX(o2.CheckOutDate)
    FROM groups g
    INNER JOIN ord o ON g.ID = o.ID
    INNER JOIN (SELECT ID, CustomerID, nx FROM groups WHERE last = 1) l ON g.CustomerID = l.CustomerID AND g.nx = l.nx
    INNER JOIN ord o2 ON l.ID = o2.ID
    GROUP BY g.nx, g.CustomerID
    HAVING COUNT(g.nx) > 1
)
SELECT o.ID, o.CustomerID
    , CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM groups g
INNER JOIN ord o ON g.ID = o.ID
INNER JOIN dates d on g.CustomerID = d.CustomerID AND g.nx = d.nx
ORDER BY ID

Output:

ID  CustomerID  CheckInDate CheckOutDate
1   1           2015-03-04  2015-05-03
3   1           NULL        NULL
4   1           NULL        NULL

For Updates, replace SELECT by UPDATE:

UPDATE g SET 
    CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM ...

Your data:

declare @data table([ID] int, [CustomerID] int, [CheckInDate] date, [CheckOutDate] date);
Insert into @data([ID], [CustomerID], [CheckInDate], [CheckOutDate])
VALUES
    (1, 1, '2015-04-02', '2015-04-05'),
    (2, 2, '2015-03-04', '2015-05-02'),
    (3, 1, '2015-04-01', NULL),
    (4, 1, '2015-03-04', '2015-05-03'),
    (5, 1, '2015-01-03', '2015-02-03')
;

It works as well with this sample:

(1, 1, '2015-04-02', '2015-04-05'),
(2, 2, '2015-03-04', '2015-05-02'),
(3, 1, '2015-04-01', NULL),
(4, 1, '2015-03-04', '2015-05-03'),
(5, 1, '2015-01-03', '2015-02-03'),
(6, 1, '2015-01-02', '2015-02-03'),
(7, 1, '2015-03-04', '2015-03-06'),
(8, 1, '2015-03-04', '2015-05-06'),
(9, 1, '2014-04-02', '2014-04-05'),
(10, 1, '2014-03-04', '2014-05-02')

If it does not work with some of your data, update Input and Output table with more relevant values.

Best Answer

Related Solutions

Sql-server – Finding rows with duplicate values

Sql-server – Updating table based on conditions (overlapping dates)

Related Question