Sql-server – Query Slowness Despite Index

sql server

Problem

I need to create a chart of User Retention over time, similar to this:
User Retention Image

Ignoring the percenages for a minute, I have a query that shows unique users for a given "cohort" and then the number of users that come back. However, with the volume of data we've acquired over the last few weeks, the query no longer finishes.

Query

;WITH dates AS
(
-- Set up the date range
SELECT convert(date,GETDATE()) as dt, 1 as Id
UNION ALL
SELECT DATEADD(dd,-1,dt),dates.Id - 1
FROM dates
WHERE Id >= -84
)
, cohort as (
-- create the cohorts
SELECT dt AS StartDate, 
    convert(date,CASE WHEN DATEADD(DD, 6, dt) > convert(date,GETDATE()) THEN convert(date,GETDATE()) ELSE DATEADD(DD, 6, dt) END) as EndDate, 
    CONCAT(FORMAT(dt, 'MMM dd'), ' - ', FORMAT(CASE WHEN DATEADD(DD, 6, dt) > GETDATE() THEN GETDATE() ELSE DATEADD(DD, 6, dt) END, 'MMM dd')) as Cohort,
    row_number() over (order by dt) as CohortNo
FROM dates A
WHERE  DATEPART(dw,dt)=1
)
 , cohortevent as (
-- The complete set of cohorts and their events
select c.*, e.*
from cohort c
left join Event e on e.eventtime between c.StartDate and C.EndDate
)
, Retained as(
-- Recursive CTE that works out how long each user has been retained
select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, case when Userid is not null then 1 else 0 end as ret
from cohortevent c
union all
select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, ret+1
from cohortevent c
join Retained on Retained.userid=c.userid and Retained.CohortNo=c.CohortNo-1 and Retained.eventid<c.eventid
)
, WeeksRetained as (
-- Get the highest number of weeks, which is the actual number per user (could probably be combined with previous CTE)
select StartDate, Enddate, Cohort, userID, 
    case when max(ret)=1 then '<1W' else '+'+convert(varchar,max(ret)-1)+'W' end as Weeks
from Retained
group by StartDate, Enddate, Cohort,userid
)
-- Finally pivot this by the number of weeks
select *
from 
(
select StartDate, EndDate, Cohort, Weeks, count(distinct userID) as UserCount
from WeeksRetained
group by StartDate, EndDate, Cohort, Weeks
) src
pivot
(
sum(UserCount)
for Weeks in ([<1W], [+1W], [+2W], [+3W], [+4W], [+5W], [+6W], [+7W], [+8W], [+9W], [+10W], [+11W], [+12W])
) piv
OPTION (MAXRECURSION 0);

Environment

All of the tables are CTE's except for "Event" which has two main columns we care about, UserId and EventTime.

What I've tried

I've added indexes on both UserId and EventTime. I noticed the DTUs (this is an Azure SQL instance) were maxing out originally, but I've vertically scaled the database instance so the DB runs at 70% DTU usage and it's still not completing in 30+ minutes. There are currently only 40k rows in Event.

Execution Plan Link

https://www.brentozar.com/pastetheplan/?id=HkB6xClKH

Best Answer

Testing

Based on your estimated execution plan I tried to get some sample data and get an actual execution of your query. Remember that while I am trying to get closer to your issue, YMMV.

CREATE TABLE dbo.[Event](EventId INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
                    EventTime date,
                    userid INT,
                    Count INT);
CREATE INDEX IX_EventTime_UserID
ON dbo.[Event](EventTime,userid);
CREATE INDEX IX_Event_UserId
ON dbo.[Event](userid)
INCLUDE(EventTIme,Count)

INSERT INTO dbo.[Event](EventTime,userid,Count)
SELECT TOP(100000) 
DATEADD(Minute,- ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),GETDATE()),
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) % 2000,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM MASTER..spt_values spt1
CROSS APPLY MASTER..spt_values spt2

This is the actual plan that came out.

While not all operators & plan choices are the same, some parts look like they could match. One of these are the high amount of spools & filters with high row counts:

enter image description here

The querytimestats here show as 60 seconds

 <QueryTimeStats CpuTime="58806" ElapsedTime="58867" />

The generation of the dates and cohort tables is fairly quick when I've broken those out separately, so I don't think those are the culprit.

The performance can differ between separate executions and joining , filtering, ... on the execution of these date tables. It is much easier to calculate estimates on one small part than on everything at once.

Now, If I simply create a temporary table, and insert the results of the cohortevent cte into this temporary table to split up the work:

CREATE TABLE #temp(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int,EventId int,EventTime date
,Count int,UserID int);

&

INSERT INTO #temp
SELECT c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID
FROM cohortevent c;

The result is a query plan with less spools & better estimates due to not evaluating everything at once. You are kind of giving the optimizer a breather by splitting the query in two parts.

Actual Execution plan With the query time stats:

 <QueryTimeStats CpuTime="4957" ElapsedTime="4961" />
&
 <QueryTimeStats CpuTime="2347" ElapsedTime="2348" />

~ 6 - 7 seconds

Adding one more temp table on the cohort cte gives me the execution time I would desire.

Actual execution plan With querytimestats=

<QueryTimeStats CpuTime="2" ElapsedTime="2" />

&

<QueryTimeStats CpuTime="929" ElapsedTime="475" />

&

    <QueryTimeStats CpuTime="2304" ElapsedTime="2305" />

~3 seconds

With the resulting query:

CREATE TABLE #cohortevent(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int,EventId int,EventTime date
,Count int,UserID int);

CREATE TABLE #cohort(
StartDate date,EndDate date,CoHort varchar(50)
,CohortNo int);


;WITH dates AS
(
    -- Set up the date range
    SELECT convert(date,GETDATE()) as dt, 1 as Id
    UNION ALL
    SELECT DATEADD(dd,-1,dt),dates.Id - 1
    FROM dates
    WHERE Id >= -84
)
, cohort as (
    -- create the cohorts
    SELECT dt AS StartDate, 
        convert(date,CASE WHEN DATEADD(DD, 6, dt) > convert(date,GETDATE()) THEN convert(date,GETDATE()) ELSE DATEADD(DD, 6, dt) END) as EndDate, 
        CONCAT(FORMAT(dt, 'MMM dd'), ' - ', FORMAT(CASE WHEN DATEADD(DD, 6, dt) > GETDATE() THEN GETDATE() ELSE DATEADD(DD, 6, dt) END, 'MMM dd')) as Cohort,
        row_number() over (order by dt) as CohortNo
    FROM dates A
    WHERE  DATEPART(dw,dt)=1
)
INSERT INTO #cohort(StartDate,EndDate,CoHort,CohortNo)
SELECT StartDate,EndDate,Cohort,CohortNo
FROM cohort;

;WITH cohortevent as (
    -- The complete set of cohorts and their events
    select c.*, e.*
    from #cohort c
    left join Event e on e.eventtime between c.StartDate and C.EndDate
)
INSERT INTO #cohortevent
(StartDate ,EndDate ,CoHort 
,CohortNo ,EventId ,EventTime 
,Count ,UserID )
SELECT c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID
FROM cohortevent c;

;WITH Retained as(
    -- Recursive CTE that works out how long each user has been retained
    select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, case when Userid is not null then 1 else 0 end as ret
    from #cohortevent c
    union all
    select c.StartDate,c.EndDate,c.CoHort,c.CohortNo,c.EventId,c.EventTime,c.Count,c.UserID, ret+1
    from #cohortevent c
    join Retained on Retained.userid=c.userid and Retained.CohortNo=c.CohortNo-1 and Retained.eventid<c.eventid
)
, WeeksRetained as (
    -- Get the highest number of weeks, which is the actual number per user (could probably be combined with previous CTE)
    select StartDate, Enddate, Cohort, userID, 
        case when max(ret)=1 then '<1W' else '+'+convert(varchar,max(ret)-1)+'W' end as Weeks
    from Retained
    group by StartDate, Enddate, Cohort,userid
)
-- Finally pivot this by the number of weeks
select *
from 
(
  select StartDate, EndDate, Cohort, Weeks, count(distinct userID) as UserCount
  from WeeksRetained
  group by StartDate, EndDate, Cohort, Weeks
) src
pivot
(
  sum(UserCount)
  for Weeks in ([<1W], [+1W], [+2W], [+3W], [+4W], [+5W], [+6W], [+7W], [+8W], [+9W], [+10W], [+11W], [+12W])
) piv
OPTION (MAXRECURSION 0)

DROP TABLE #cohortevent
DROP TABLE #cohort

This will not be the best performing version of your query but this should resolve the issue with your spools going crazy. You should also investigate other workarounds like using a calendar table like @DanGuzman mentioned.