Sql-server – Calculate Total Visits

gaps-and-islandssql serversql-server-2008-r2

I am trying to write a query where I have to calculate number of visits for a customer by taking care of overlapping days. Suppose for itemID 2009 start date is 23rd and end date is 26th therefore item 20010 is between these days we will not add this purchase date to our total count.

Example Scenario:

Item ID Start Date   End Date   Number of days     Number of days Candidate for visit count
20009   2015-01-23  2015-01-26     4                      4
20010   2015-01-24  2015-01-24     1                      0
20011   2015-01-23  2015-01-26     4                      0
20012   2015-01-23  2015-01-27     5                      1
20013   2015-01-23  2015-01-27     5                      0
20014   2015-01-29  2015-01-30     2                      2

OutPut should be 7 VisitDays

Input Table:

CREATE TABLE #Items    
(
CustID INT,
ItemID INT,
StartDate DATETIME,
EndDate DATETIME
)           


INSERT INTO #Items
SELECT 11205, 20009, '2015-01-23',  '2015-01-26'  
UNION ALL 
SELECT 11205, 20010, '2015-01-24',  '2015-01-24'    
UNION ALL  
SELECT 11205, 20011, '2015-01-23',  '2015-01-26' 
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'  
UNION ALL  
SELECT 11205, 20012, '2015-01-23',  '2015-01-27'   
UNION ALL  
SELECT 11205, 20012, '2015-01-28',  '2015-01-29'

I have tried so far:

CREATE TABLE #VisitsTable
    (
      StartDate DATETIME,
      EndDate DATETIME
    )

INSERT  INTO #VisitsTable
        SELECT DISTINCT
                StartDate,
                EndDate
        FROM    #Items items
        WHERE   CustID = 11205
        ORDER BY StartDate ASC

IF EXISTS (SELECT TOP 1 1 FROM #VisitsTable) 
BEGIN 


SELECT  ISNULL(SUM(VisitDays),1)
FROM    ( SELECT DISTINCT
                    abc.StartDate,
                    abc.EndDate,
                    DATEDIFF(DD, abc.StartDate, abc.EndDate) + 1 VisitDays
          FROM      #VisitsTable abc
                    INNER JOIN #VisitsTable bc ON bc.StartDate NOT BETWEEN abc.StartDate AND abc.EndDate      
        ) Visits

END



--DROP TABLE #Items 
--DROP TABLE #VisitsTable

Best Answer

This first query creates different Start Date and End Date ranges with no overlaps.

Note:

Your sample(id=0) is mixed with a sample from Ypercube (id=1)
This solution may not scale well with huge amount of data for each id or huge number of id. This has the advantage of not requiring a number table. With large dataset, a number table will very likely give better performances.

Query:

SELECT DISTINCT its.id
    , Start_Date = its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    --, x1=itmax.End_Date, x2=itmin.Start_Date, x3=its.End_Date
FROM @Items its
OUTER APPLY (
    SELECT Start_Date = MAX(End_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
) itmin
OUTER APPLY (
    SELECT End_Date = MIN(Start_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID+1000 AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
) itmax;

Output:

id  | Start_Date                    | End_Date                      
0   | 2015-01-23 00:00:00.0000000   | 2015-01-23 00:00:00.0000000   => 1
0   | 2015-01-24 00:00:00.0000000   | 2015-01-27 00:00:00.0000000   => 4
0   | 2015-01-29 00:00:00.0000000   | 2015-01-30 00:00:00.0000000   => 2
1   | 2016-01-20 00:00:00.0000000   | 2016-01-22 00:00:00.0000000   => 3
1   | 2016-01-23 00:00:00.0000000   | 2016-01-24 00:00:00.0000000   => 2
1   | 2016-01-25 00:00:00.0000000   | 2016-01-29 00:00:00.0000000   => 5

If you use these Start Date and End Date with DATEDIFF:

SELECT DATEDIFF(day
    , its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
) + 1
...

Output (with duplicates) is:

1, 4 and 2 for id 0 (your sample => SUM=7)
3, 2 and 5 for id 1 (Ypercube sample => SUM=10)

You then only need to put everything together with a SUM and GROUP BY:

SELECT id 
    , Days = SUM(
        DATEDIFF(day, Start_Date, End_Date)+1
    )
FROM (
    SELECT DISTINCT its.id
         , Start_Date = its.Start_Date 
        , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    FROM @Items its
    OUTER APPLY (
        SELECT Start_Date = MAX(End_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
    ) itmin
    OUTER APPLY (
        SELECT End_Date = MIN(Start_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
    ) itmax
) as d
GROUP BY id;

Output:

id  Days
0   7
1   10

Data used with 2 different ids:

INSERT INTO @Items
    (id, Item_ID, Start_Date, End_Date)
VALUES 
    (0, 20009, '2015-01-23', '2015-01-26'),
    (0, 20010, '2015-01-24', '2015-01-24'),
    (0, 20011, '2015-01-23', '2015-01-26'),
    (0, 20012, '2015-01-23', '2015-01-27'),
    (0, 20013, '2015-01-23', '2015-01-27'),
    (0, 20014, '2015-01-29', '2015-01-30'),

    (1, 20009, '2016-01-20', '2016-01-24'),
    (1, 20010, '2016-01-23', '2016-01-26'),
    (1, 20011, '2016-01-25', '2016-01-29')

Related Solutions

Sql-server – the best way to get all data for a date range, plus the last event just before the range

I am going to assume that there isn't an index on the date columns, otherwise I think that the query would have been structured differently. If there is, you can probably find a better performing one than this.

The advantage of this query is that it can get all the data in one scan. The disadvantage is that it has to sort the data and join EventEmployee on the entire table. So as always, test with your own situation. This query also assumes that the MAX date is either unique or that equivalent rows would be acceptable.

USE AdventureWorks2012
GO
;
WITH Base AS (
   SELECT 
      TransactionHistory.*
      ,ProductVendor.BusinessEntityID
      ,MAX(CASE WHEN TransactionDate < '2008-08-01' THEN TransactionDate END) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS PreviousVendorTransaction
      ,COUNT(CASE WHEN TransactionDate >= '2008-08-01' THEN 1 END ) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS VendorAfterCutoff
   FROM
      Production.TransactionHistory
      -- Doesn't make the most sense, but I need a repeating relation
      INNER JOIN Purchasing.ProductVendor
         ON TransactionHistory.ProductID = ProductVendor.ProductID
),
Filtered AS (
   SELECT
      *
   FROM
      Base
   WHERE
      Base.TransactionDate >= '2008-08-01'
      OR (TransactionDate = PreviousVendorTransaction AND VendorAfterCutoff > 0)
)
SELECT DISTINCT
   TransactionID
   ,ProductID
   ,ReferenceOrderID
   ,ReferenceOrderLineID
   ,TransactionDate
   ,TransactionType
   ,Quantity
   ,ActualCost
   ,ModifiedDate
FROM
   Filtered

Edit:

Hmm, I think I may have to take back my comment on structuring it differently if there are indexes. The other suggestions that I have are probably fairly minor.

Make sure the query is using the indexes you're expecting it to. Start and End date to build temp table, end date to drive the previous event loop.
If the query to build the temp table is doing a lookup on the clustered index, it may be better to hold off and do that as part of the main query.
Try using a cte instead of a temp table. I think that a cte might be more competitive with the way that the query is structured below.
If you are returning a lot of events, it might be better to pull out the event table lookup to the main query to give the optimizer the option of doing a merge join.
I don't see a way of optimizing the previous event lookup short of an indexed view.

Here's a query that combines a few of those ideas.

SELECT
    e.[EventID]
INTO #EventTemp
FROM
    [Events] AS e
WHERE
    ( e.[EventStart] >= @StartDate AND e.[EventStart] <= @EndDate )
    OR ( e.[EventEnd] >= @StartDate AND e.[EventEnd] <= @EndDate )

;
WITH PrevEvent AS (
    SELECT
        EmpPrevEvent.[EventID]
    FROM
    (
        SELECT DISTINCT
            ee.[EmployeeID]
        FROM
            #EventTemp
            INNER JOIN [EventEmployee] AS ee ON
                #EventTemp.[EventID] = ee.[EventID]
    ) AS Emp
    CROSS APPLY (
        SELECT TOP 1
            e.[EventID]
        FROM
            [Events] AS e
            INNER JOIN [EventEmployee] AS ee ON
                e.[EventID] = ee.[EventID]
        WHERE
            ee.[EmployeeID] = Emp.[EmployeeID]
            AND e.[EventEnd] < @StartDate
        ORDER BY 
            e.[EventEnd] DESC
    ) AS EmpPrevEvent
)
SELECT
    e.[EventID],
    e.[EventStart],
    e.[EventEnd],
    e.[EventTypeID]
FROM
    [Events] AS e
WHERE
    e.EventID IN (
        SELECT EventID
        FROM #EventTemp
        UNION
        SELECT EventID
        FROM PrevEvent
    )

Sql-server – Generate a column from different rows in sql

To address this problem, I did the following (I'm using PostgreSQL for this, but SQL Server has CTEs and RECURSIVE CTEs and window/analytic functions. I doubt if this can be done in MySQL - maybe using variables - not sure). PostgreSQL is as powerful as the big boys - MySQL is really a toy!

Created a table:

CREATE TABLE stock (sku int, datein date, dateout date, status int);

Populated it thus:

INSERT INTO stock VALUES (123, '2015-02-01', '2015-02-05', 1);
INSERT INTO stock VALUES (123, '2015-02-10', '2015-02-10', 1);
INSERT INTO stock VALUES (123, '2015-02-20', NULL,         0);

Then, I constructed the first part of my query. The actual dates to and from will obviously vary depending on the requirements - I assumed that (from the sample data), the OP wanted to look at the month of February. I constructed the following RECURSIVE CTE to construct a list of the days in February - PostgreSQL has the nifty generate_series function, but that's proprietary and I wanted to make my query portable (to the greatest extent possible). This is the best explanation of RCTEs I came across.

WITH RECURSIVE dates (test_date) AS
(
  SELECT '2015-02-01'::DATE
  UNION ALL
  SELECT test_date + 1 FROM dates
  WHERE test_date < '2015-02-28'
)

Now, I thought that the organisation of the raw data was poor, so I used CTEs to rearrange the data.

Get the outdates from the raw data:

out_dates AS
(
  SELECT dateout AS dout FROM stock
)

Get the indates from the raw data:

in_dates AS
(
  SELECT datein AS din FROM stock
)

Then I "transpose" the data - have to use two steps because one cannot use a Window function in a WHERE clause

transpose AS
(
  SELECT lag(dateout) over() AS datein, datein AS dateout FROM stock
  -- WHERE lag(datein) over() IS NOT NULL AND dateout IS NOT NULL <<-- can't use here.
  ORDER BY dateout, datein
)

And then

ranges AS
(
  SELECT * FROM transpose
  WHERE datein IS NOT NULL AND dateout IS NOT NULL
),

The result of this query is:

datein|dateout
2015-02-05|2015-02-10
2015-02-18|2015-02-20

Note that this corresponds to the in and out dates, but now the data is much simpler to handle because it's structure is more logical - it's chronologically coherent - thanks to the use of the LAG window function.

Now, we bring it all together with the next query:

x AS
(
  SELECT test_date, CASE WHEN has_match THEN 1 ELSE 0 END::int AS flag
  FROM 
  (
    SELECT test_date, 
    EXISTS
    (
      SELECT 1 
      FROM ranges 
      WHERE dates.test_date BETWEEN ranges.datein AND ranges.dateout
    ) AS has_match
  FROM dates  
  ) range_checks
)
SELECT * FROM x;

This gives a series of records for the month that are of this form:

test_date|flag
--------------
2015-02-01|0
2015-02-02|0
2015-02-03|0
2015-02-04|0
2015-02-05|1
2015-02-06|1
<other records snipped for brevity>

or use this as the final query

z AS 
(
  SELECT * FROM dates d
  LEFT JOIN ranges r
  ON (d.test_date BETWEEN r.datein AND r.dateout)
)
SELECT * FROM z;

for records of this form:

test_date|datein|dateout
2015-02-01||
2015-02-02||
2015-02-03||
2015-02-04||
2015-02-05|2015-02-05|2015-02-10
2015-02-06|2015-02-05|2015-02-10
<rest of the records snipped for brevity>

Best Answer

Query:

Output:

Output:

Data used with 2 different ids:

Related Solutions

Sql-server – the best way to get all data for a date range, plus the last event just before the range

Sql-server – Generate a column from different rows in sql

Related Question