SQL Server – Populating Missing Data Based on Previous Month-End Values

gaps-and-islandssql serversql-server-2008-r2

Given the following data:

create table #histories
(
    username varchar(10),
    account varchar(10),
    assigned date  
);

insert into #histories 
values 
('PHIL','ACCOUNT1','2017-01-04'),
('PETER','ACCOUNT1','2017-01-15'),
('DAVE','ACCOUNT1','2017-03-04'),
('ANDY','ACCOUNT1','2017-05-06'),
('DAVE','ACCOUNT1','2017-05-07'),
('FRED','ACCOUNT1','2017-05-08'),
('JAMES','ACCOUNT1','2017-08-05'),
('DAVE','ACCOUNT2','2017-01-02'),
('PHIL','ACCOUNT2','2017-01-18'),
('JOSH','ACCOUNT2','2017-04-08'),
('JAMES','ACCOUNT2','2017-04-09'),
('DAVE','ACCOUNT2','2017-05-06'),
('PHIL','ACCOUNT2','2017-05-07') ; 

… which represents when a given user was assigned to an account.

I am looking to establish who owned a given account on the last day of each month (the assigned date is the date that the account transferred ownership), with any missing month-ends populated (possibly created from a handy dates table that I have available, with useful columns DateKey, Date and LastDayOfMonth, [courtesy of @AaronBertrand])1.

The desired results would be:

PETER, ACCOUNT1, 2017-01-31
PETER, ACCOUNT1, 2017-02-28
DAVE, ACCOUNT1, 2017-03-31
DAVE, ACCOUNT1, 2017-04-30
FRED, ACCOUNT1, 2017-05-31
FRED, ACCOUNT1, 2017-06-30
FRED, ACCOUNT1, 2017-07-31
JAMES, ACCOUNT1, 2017-08-31
PHIL, ACCOUNT2, 2017-01-31
PHIL, ACCOUNT2, 2017-02-28
PHIL, ACCOUNT2, 2017-03-31
JAMES, ACCOUNT2, 2017-04-30
PHIL, ACCOUNT2, 2017-05-31

Doing the initial part of this with a windowing function is trivial, it's adding the "missing" rows that I'm struggling with.

Best Answer

One approach to this problem is to do the following:

  1. Emulate LEAD on SQL Server 2008. You can use APPLY or a suquery for this.
  2. For rows without a next row, add one month to their account date.
  3. Join to a dimension table that contains month end dates. This eliminates all rows that don't span at least a month and adds rows to fill in the gaps as necessary.

I modified your test data a little bit to make the results deterministic. Also added an index:

create table #histories
(
    username varchar(10),
    account varchar(10),
    assigned date  
);

insert into #histories 
values 
('PHIL','ACCOUNT1','2017-01-04'),
('PETER','ACCOUNT1','2017-01-15'),
('DAVE','ACCOUNT1','2017-03-04'),
('ANDY','ACCOUNT1','2017-05-06'),
('DAVE','ACCOUNT1','2017-05-07'),
('FRED','ACCOUNT1','2017-05-08'),
('JAMES','ACCOUNT1','2017-08-05'),
('DAVE','ACCOUNT2','2017-01-02'),
('PHIL','ACCOUNT2','2017-01-18'),
('JOSH','ACCOUNT2','2017-04-08'), -- changed this date to have deterministic results
('JAMES','ACCOUNT2','2017-04-09'),
('DAVE','ACCOUNT2','2017-05-06'),
('PHIL','ACCOUNT2','2017-05-07') ;

-- make life easy
create index gotta_go_fast ON #histories (account, assigned);

Here's the laziest date dimension table of all time:

create table #date_dim_months_only (
    month_date date,
    primary key (month_date)
);

-- put 2500 month ends into table
INSERT INTO #date_dim_months_only WITH (TABLOCK)
SELECT DATEADD(DAY, -1, DATEADD(MONTH, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), '20000101'))
FROM master..spt_values;

For step 1, there are plenty of ways to emulate LEAD. Here's one method:

SELECT 
  h1.username
, h1.account
, h1.assigned
, next_date.assigned
FROM #histories h1
OUTER APPLY (
    SELECT TOP 1 h2.assigned
    FROM #histories h2
    WHERE h1.account = h2.account
    AND h1.assigned < h2.assigned
    ORDER BY h2.assigned ASC
) next_date;

For step 2, we need to change the NULL values to something else. You want to include the final month for each account, so adding one month to the starting date suffices:

ISNULL(next_date.assigned, DATEADD(MONTH, 1, h1.assigned))

For step 3, we can join to the date dimension table. The column from the dimension table is exactly the column you need for the result set:

INNER JOIN #date_dim_months_only dd ON
    dd.month_date >= h1.assigned AND
    dd.month_date < ISNULL(next_date.assigned, DATEADD(MONTH, 1, h1.assigned))

I didn't like the query that I got when I put it all together. There can be issues with join order when combining OUTER APPLY and INNER JOIN. To get the join order I wanted I rewrote it with a subquery:

SELECT 
  hist.username
, hist.account
, dd.month_date 
FROM
(
    SELECT 
      h1.username
    , h1.account
    , h1.assigned
    , ISNULL(
        (SELECT TOP 1 h2.assigned
            FROM #histories h2
            WHERE h1.account = h2.account
            AND h1.assigned < h2.assigned
            ORDER BY h2.assigned ASC
        )
        , DATEADD(MONTH, 1, h1.assigned)
    ) next_assigned
    FROM #histories h1
) hist
INNER JOIN #date_dim_months_only dd ON
    dd.month_date >= hist.assigned AND
    dd.month_date < hist.next_assigned;

I don't know how much data you have so it might not matter for you. But the plan looks how I want it to:

good plan

The results match yours:

╔══════════╦══════════╦════════════╗
║ username ║ account  ║ month_date ║
╠══════════╬══════════╬════════════╣
║ PETER    ║ ACCOUNT1 ║ 2017-01-31 ║
║ PETER    ║ ACCOUNT1 ║ 2017-02-28 ║
║ DAVE     ║ ACCOUNT1 ║ 2017-03-31 ║
║ DAVE     ║ ACCOUNT1 ║ 2017-04-30 ║
║ FRED     ║ ACCOUNT1 ║ 2017-05-31 ║
║ FRED     ║ ACCOUNT1 ║ 2017-06-30 ║
║ FRED     ║ ACCOUNT1 ║ 2017-07-31 ║
║ JAMES    ║ ACCOUNT1 ║ 2017-08-31 ║
║ PHIL     ║ ACCOUNT2 ║ 2017-01-31 ║
║ PHIL     ║ ACCOUNT2 ║ 2017-02-28 ║
║ PHIL     ║ ACCOUNT2 ║ 2017-03-31 ║
║ JAMES    ║ ACCOUNT2 ║ 2017-04-30 ║
║ PHIL     ║ ACCOUNT2 ║ 2017-05-31 ║
╚══════════╩══════════╩════════════╝