Sql-server – Total of orders open on a given date for each date in a date range

sql-server-2008t-sql

This question is similar to Running total with count?, but please allow me to explain some further twists on the issue.

I'm using SQL Server 2008, so the cursor option described by Aaron Bertrand seems to be the most promising in terms of speed.

What's different here, though, is I have to take into account two dates for a single item. So, given an OrderID item within a table of Orders, the OrderID has an Opened Date and a Closed Date. The Opened Date is always populated, but the Closed Date could be NULL.

OrderID  OpenedDate    ClosedDate
654554    12/1/2011     5/4/2012
678451    12/4/2011     3/2/2012
679565    12/8/2011     5/21/2012
701541    5/23/2012     NULL
...

I need to – efficiently – get back how many Orders had what we could term an "Open" status on any given date within a date range. The date range could span a couple of years.
(Yes, I do have a reference table of sequential dates.)

Date         CountOfOpenOrders
12/1/2011    175
12/2/2011    178
12/3/2011    195
12/4/2011    192
12/5/2011    191
...

Best Answer

If your priority is speed of selects, the following approach allows for extremely fast selects. Instead of storing a period, you want to store two events (period start and period end). Change column is 1 when the period begins, and is -1 when the period ends. If more than one event occurs on the same day, they must have different EventNumberPerDay. RunningTotal is the number of open peroids after the event has happened:

CREATE TABLE dbo.Events(
 PeriodId INT NOT NULL,
 Change SMALLINT NOT NULL,
 ChangeDate DATE NOT NULL,
 EventNumberPerDay INT NOT NULL,
 RunningTotal INT NOT NULL);
GO

INSERT dbo.Events
        ( PeriodId ,
          Change ,
          ChangeDate ,
          EventNumberPerDay,
          RunningTotal
        )
-- first period begins    
VALUES  ( 1 , 1, '20120801', 1, 1),
-- second period begins on the same day
(2, 1, '20120801', 2, 2),
-- third period begins
(3,1,'20120803',1, 3),
-- second period ends on the same day
(2,-1,'20120803',2, 2),
-- fourth period begins
(4,1,'20120804',1,3),
-- fourth period ends
(4,-1,'20120805',1,2),
-- first period ends
(1, -1, '20120808',1,  1),
-- third period ends
(3, -1, '20120809',1,  0);
GO

Also you need a calendar table:

CREATE TABLE dbo.Calendar([Date] DATE NOT NULL);
GO

INSERT INTO dbo.Calendar([Date])
VALUES('20120801'),
('20120802'),
('20120803'),
('20120804'),
('20120805'),
('20120806'),
('20120807'),
('20120808'),
('20120809'),
('20120810'),
('20120811');

Once that is accomplished, then your select is very simple and very fast:

SELECT  [Date] ,
        coalesce(RunningTotal, 0) AS NumOpenIntervals
FROM    dbo.Calendar
        OUTER APPLY ( SELECT TOP ( 1 )
                                RunningTotal
                      FROM      dbo.Events
                      WHERE     ChangeDate <= [Date]
                      ORDER BY  ChangeDate DESC, EventNumberPerDay DESC
                    ) AS t
ORDER BY [Date]

Of course, this query is only correct if the data in Events table is valid. We can use constraints to ensure 100% data integrity. I can explain how if you are interested.

Another alternative is to just load your raw data, your periods, into a client application - your problem is absolutely trivial in C++/C#/Java.

Yet another approach is to use an RDBMS with fast cursors such as Oracle - that will allow you to just write a simple cursor and enjoy good performance, but still not always as good as my first solution.