Sql-server – Query Slowness Despite Index

sql server

Problem

I need to create a chart of User Retention over time, similar to this:

Ignoring the percenages for a minute, I have a query that shows unique users for a given "cohort" and then the number of users that come back. However, with the volume of data we've acquired over the last few weeks, the query no longer finishes.

Query

;WITH dates AS
(
-- Set up the date range
SELECT convert(date,GETDATE()) as dt, 1 as Id
UNION ALL
SELECT DATEADD(dd,-1,dt),dates.Id - 1
FROM dates
WHERE Id >= -84
)
, cohort as (
-- create the cohorts
SELECT dt AS StartDate, 
    convert(date,CASE WHEN DATEADD(DD, 6, dt) > convert(date,GETDATE()) THEN convert(date,GETDATE()) ELSE DATEADD(DD, 6, dt) END) as EndDate, 
    CONCAT(FORMAT(dt, 'MMM dd'), ' - ', FORMAT(CASE WHEN DATEADD(DD, 6, dt) > GETDATE() THEN GETDATE() ELSE DATEADD(DD, 6, dt) END, 'MMM dd')) as Cohort,
    row_number() over (order by dt) as CohortNo
FROM dates A
WHERE  DATEPART(dw,dt)=1
)
 , cohortevent as (
-- The complete set of cohorts and their events
select c.*, e.*
from cohort c
left join Event e on e.eventtime between c.StartDate and C.EndDate
)
, Retained as(
-- Recursive CTE that works out how long each user has been retained
select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, case when Userid is not null then 1 else 0 end as ret
from cohortevent c
union all
select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, ret+1
from cohortevent c
join Retained on Retained.userid=c.userid and Retained.CohortNo=c.CohortNo-1 and Retained.eventid<c.eventid
)
, WeeksRetained as (
-- Get the highest number of weeks, which is the actual number per user (could probably be combined with previous CTE)
select StartDate, Enddate, Cohort, userID, 
    case when max(ret)=1 then '<1W' else '+'+convert(varchar,max(ret)-1)+'W' end as Weeks
from Retained
group by StartDate, Enddate, Cohort,userid
)
-- Finally pivot this by the number of weeks
select *
from 
(
select StartDate, EndDate, Cohort, Weeks, count(distinct userID) as UserCount
from WeeksRetained
group by StartDate, EndDate, Cohort, Weeks
) src
pivot
(
sum(UserCount)
for Weeks in ([<1W], [+1W], [+2W], [+3W], [+4W], [+5W], [+6W], [+7W], [+8W], [+9W], [+10W], [+11W], [+12W])
) piv
OPTION (MAXRECURSION 0);

Environment

All of the tables are CTE's except for "Event" which has two main columns we care about, UserId and EventTime.

What I've tried

I've added indexes on both UserId and EventTime. I noticed the DTUs (this is an Azure SQL instance) were maxing out originally, but I've vertically scaled the database instance so the DB runs at 70% DTU usage and it's still not completing in 30+ minutes. There are currently only 40k rows in Event.

Execution Plan Link

https://www.brentozar.com/pastetheplan/?id=HkB6xClKH

Best Answer

Testing

Based on your estimated execution plan I tried to get some sample data and get an actual execution of your query. Remember that while I am trying to get closer to your issue, YMMV.

CREATE TABLE dbo.[Event](EventId INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
                    EventTime date,
                    userid INT,
                    Count INT);
CREATE INDEX IX_EventTime_UserID
ON dbo.[Event](EventTime,userid);
CREATE INDEX IX_Event_UserId
ON dbo.[Event](userid)
INCLUDE(EventTIme,Count)

INSERT INTO dbo.[Event](EventTime,userid,Count)
SELECT TOP(100000) 
DATEADD(Minute,- ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),GETDATE()),
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) % 2000,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM MASTER..spt_values spt1
CROSS APPLY MASTER..spt_values spt2

This is the actual plan that came out.

While not all operators & plan choices are the same, some parts look like they could match. One of these are the high amount of spools & filters with high row counts:

The querytimestats here show as 60 seconds

 <QueryTimeStats CpuTime="58806" ElapsedTime="58867" />

The generation of the dates and cohort tables is fairly quick when I've broken those out separately, so I don't think those are the culprit.

The performance can differ between separate executions and joining , filtering, ... on the execution of these date tables. It is much easier to calculate estimates on one small part than on everything at once.

Now, If I simply create a temporary table, and insert the results of the cohortevent cte into this temporary table to split up the work:

CREATE TABLE #temp(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int,EventId int,EventTime date
,Count int,UserID int);

INSERT INTO #temp
SELECT c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID
FROM cohortevent c;

The result is a query plan with less spools & better estimates due to not evaluating everything at once. You are kind of giving the optimizer a breather by splitting the query in two parts.

Actual Execution plan With the query time stats:

 <QueryTimeStats CpuTime="4957" ElapsedTime="4961" />
&
 <QueryTimeStats CpuTime="2347" ElapsedTime="2348" />

~ 6 - 7 seconds

Adding one more temp table on the cohort cte gives me the execution time I would desire.

Actual execution plan With querytimestats=

<QueryTimeStats CpuTime="2" ElapsedTime="2" />

<QueryTimeStats CpuTime="929" ElapsedTime="475" />

    <QueryTimeStats CpuTime="2304" ElapsedTime="2305" />

~3 seconds

With the resulting query:

CREATE TABLE #cohortevent(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int,EventId int,EventTime date
,Count int,UserID int);

CREATE TABLE #cohort(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int);


;WITH dates AS
(
    -- Set up the date range
    SELECT convert(date,GETDATE()) as dt, 1 as Id
    UNION ALL
    SELECT DATEADD(dd,-1,dt),dates.Id - 1
    FROM dates
    WHERE Id >= -84
)
, cohort as (
    -- create the cohorts
    SELECT dt AS StartDate, 
        convert(date,CASE WHEN DATEADD(DD, 6, dt) > convert(date,GETDATE()) THEN convert(date,GETDATE()) ELSE DATEADD(DD, 6, dt) END) as EndDate, 
        CONCAT(FORMAT(dt, 'MMM dd'), ' - ', FORMAT(CASE WHEN DATEADD(DD, 6, dt) > GETDATE() THEN GETDATE() ELSE DATEADD(DD, 6, dt) END, 'MMM dd')) as Cohort,
        row_number() over (order by dt) as CohortNo
    FROM dates A
    WHERE  DATEPART(dw,dt)=1
)
INSERT INTO #cohort(StartDate,EndDate,CoHort,CohortNo)
SELECT StartDate,EndDate,Cohort,CohortNo
FROM cohort;

;WITH cohortevent as (
    -- The complete set of cohorts and their events
    select c.*, e.*
    from #cohort c
    left join Event e on e.eventtime between c.StartDate and C.EndDate
)
INSERT INTO #cohortevent
(StartDate ,EndDate ,CoHort 
,CohortNo ,EventId ,EventTime 
,Count ,UserID )
SELECT c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID
FROM cohortevent c;

;WITH Retained as(
    -- Recursive CTE that works out how long each user has been retained
    select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, case when Userid is not null then 1 else 0 end as ret
    from #cohortevent c
    union all
    select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, ret+1
    from #cohortevent c
    join Retained on Retained.userid=c.userid and Retained.CohortNo=c.CohortNo-1 and Retained.eventid<c.eventid
)
, WeeksRetained as (
    -- Get the highest number of weeks, which is the actual number per user (could probably be combined with previous CTE)
    select StartDate, Enddate, Cohort, userID, 
        case when max(ret)=1 then '<1W' else '+'+convert(varchar,max(ret)-1)+'W' end as Weeks
    from Retained
    group by StartDate, Enddate, Cohort,userid
)
-- Finally pivot this by the number of weeks
select *
from 
(
  select StartDate, EndDate, Cohort, Weeks, count(distinct userID) as UserCount
  from WeeksRetained
  group by StartDate, EndDate, Cohort, Weeks
) src
pivot
(
  sum(UserCount)
  for Weeks in ([<1W], [+1W], [+2W], [+3W], [+4W], [+5W], [+6W], [+7W], [+8W], [+9W], [+10W], [+11W], [+12W])
) piv
OPTION (MAXRECURSION 0)

DROP TABLE #cohortevent
DROP TABLE #cohort

This will not be the best performing version of your query but this should resolve the issue with your spools going crazy. You should also investigate other workarounds like using a calendar table like @DanGuzman mentioned.

Related Solutions

How to Avoid Using Variables in SQL Server WHERE Clause

Parameter sniffing is your friend almost all of the time and you should write your queries so that it can be used. Parameter sniffing helps building the plan for you using the parameter values available when the query is compiled. The dark side of parameter sniffing is when the values used when compiling the query is not optimal for the queries to come.

The query in a stored procedure is compiled when the stored procedure is executed, not when the query is executed so the values that SQL Server has to deal with here...

CREATE PROCEDURE WeeklyProc(@endDate DATE)
AS
BEGIN
  DECLARE @startDate DATE = DATEADD(DAY, -6, @endDate)
  SELECT
    -- Stuff
  FROM Sale
  WHERE SaleDate BETWEEN @startDate AND @endDate
END

is a known value for @endDate and an unknown value for @startDate. That will leave SQL Server to guessing on 30% of the rows returned for the filter on @startDate combined with whatever the statistics tells it for @endDate. If you have a big table with a lot of rows that could give you a scan operation where you would benefit most from a seek.

Your wrapper procedure solution makes sure that SQL Server sees the values when DateRangeProc is compiled so it can use known values for both @endDate and @startDate.

Both your dynamic queries leads to the same thing, the values are known at compile-time.

The one with a default null value is a bit special. The values known to SQL Server at compile-time is a known value for @endDate and null for @startDate. Using a null in a between will give you 0 rows but SQL Server always guess at 1 in those cases. That might be a good thing in this case but if you call the stored procedure with a large date interval where a scan would have been the best choice it may end up doing a bunch of seeks.

I left "Use the DATEADD() function directly" to the end of this answer because it is the one I would use and there is something strange with it as well.

First off, SQL Server does not call the function multiple times when it is used in the where clause. DATEADD is considered runtime constant.

And I would think that DATEADD is evaluated when the query is compiled so that you would get a good estimate on the number of rows returned. But it is not so in this case.
SQL Server estimates based on the value in the parameter regardless of what you do with DATEADD (tested on SQL Server 2012) so in your case the estimate will be the number of rows that is registered on @endDate. Why it does that I don't know but it has to do with the use of the datatype DATE. Shift to DATETIME in the stored procedure and the table and the estimate will be accurate, meaning that DATEADD is considered at compile time for DATETIME not for DATE.

So to summarize this rather lengthy answer I would recommend the wrapper procedure solution. It will always allow SQL Server to use the values provided when compiling the the query without the hassle of using dynamic SQL.

PS:

In comments you got two suggestions.

OPTION (OPTIMIZE FOR UNKNOWN) will give you an estimate of 9% of rows returned and OPTION (RECOMPILE) will make SQL Server see the parameter values since the query is recompiled every time.

Sql-server – the best way to get all data for a date range, plus the last event just before the range

I am going to assume that there isn't an index on the date columns, otherwise I think that the query would have been structured differently. If there is, you can probably find a better performing one than this.

The advantage of this query is that it can get all the data in one scan. The disadvantage is that it has to sort the data and join EventEmployee on the entire table. So as always, test with your own situation. This query also assumes that the MAX date is either unique or that equivalent rows would be acceptable.

USE AdventureWorks2012
GO
;
WITH Base AS (
   SELECT 
      TransactionHistory.*
      ,ProductVendor.BusinessEntityID
      ,MAX(CASE WHEN TransactionDate < '2008-08-01' THEN TransactionDate END) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS PreviousVendorTransaction
      ,COUNT(CASE WHEN TransactionDate >= '2008-08-01' THEN 1 END ) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS VendorAfterCutoff
   FROM
      Production.TransactionHistory
      -- Doesn't make the most sense, but I need a repeating relation
      INNER JOIN Purchasing.ProductVendor
         ON TransactionHistory.ProductID = ProductVendor.ProductID
),
Filtered AS (
   SELECT
      *
   FROM
      Base
   WHERE
      Base.TransactionDate >= '2008-08-01'
      OR (TransactionDate = PreviousVendorTransaction AND VendorAfterCutoff > 0)
)
SELECT DISTINCT
   TransactionID
   ,ProductID
   ,ReferenceOrderID
   ,ReferenceOrderLineID
   ,TransactionDate
   ,TransactionType
   ,Quantity
   ,ActualCost
   ,ModifiedDate
FROM
   Filtered

Edit:

Hmm, I think I may have to take back my comment on structuring it differently if there are indexes. The other suggestions that I have are probably fairly minor.

Make sure the query is using the indexes you're expecting it to. Start and End date to build temp table, end date to drive the previous event loop.
If the query to build the temp table is doing a lookup on the clustered index, it may be better to hold off and do that as part of the main query.
Try using a cte instead of a temp table. I think that a cte might be more competitive with the way that the query is structured below.
If you are returning a lot of events, it might be better to pull out the event table lookup to the main query to give the optimizer the option of doing a merge join.
I don't see a way of optimizing the previous event lookup short of an indexed view.

Here's a query that combines a few of those ideas.

SELECT
    e.[EventID]
INTO #EventTemp
FROM
    [Events] AS e
WHERE
    ( e.[EventStart] >= @StartDate AND e.[EventStart] <= @EndDate )
    OR ( e.[EventEnd] >= @StartDate AND e.[EventEnd] <= @EndDate )

;
WITH PrevEvent AS (
    SELECT
        EmpPrevEvent.[EventID]
    FROM
    (
        SELECT DISTINCT
            ee.[EmployeeID]
        FROM
            #EventTemp
            INNER JOIN [EventEmployee] AS ee ON
                #EventTemp.[EventID] = ee.[EventID]
    ) AS Emp
    CROSS APPLY (
        SELECT TOP 1
            e.[EventID]
        FROM
            [Events] AS e
            INNER JOIN [EventEmployee] AS ee ON
                e.[EventID] = ee.[EventID]
        WHERE
            ee.[EmployeeID] = Emp.[EmployeeID]
            AND e.[EventEnd] < @StartDate
        ORDER BY 
            e.[EventEnd] DESC
    ) AS EmpPrevEvent
)
SELECT
    e.[EventID],
    e.[EventStart],
    e.[EventEnd],
    e.[EventTypeID]
FROM
    [Events] AS e
WHERE
    e.EventID IN (
        SELECT EventID
        FROM #EventTemp
        UNION
        SELECT EventID
        FROM PrevEvent
    )