Sql-server – Count Rows that exist before and not after and after but not before

countjoin;sql server

I have an Azure SQL Database, one Table that has 3 Relevant Columns: Brand, ID and Month/Year and a Temporary Table that simply lists the Month/Year with an ID.

The table shows a list of all IDs that were active for that given month.

Month/Year is a datestamp column, with the date as the first of the Month. The ID Column is just an ID.

The goal is to have an output that for each month will show the count of all the records that were active in this month, but not active in the next month (Churn) and IDs that weren't active in the previous month but where active in this month (New) and have these grouped by brand and Month so:

Brand | Churn | New | Month
1 | 10 | 12 | 2019-12-01
2 | 11 | 9 | 2019-12-01

I should also note that Table1 has 19 million rows.

I tried the below code:

Declare @Id int
declare @my date

set @Id = 300
set @my = '2019-12-01'

select t1.[brand],
count(c.[ID]) as [churn],
count(n.[ID]) as [new],
t1.[month/year]
from Table1 T1

left join Table1 c on
c.[ID] = t1.[ID] and c.[ID] not in (select [ID] from Table1 where [Month/Year] = (select [month/year] from Temp_Date where id = (@Id +1)))
left join Table1 n on
n.[ID] = T1.[ID] and n.[ID] not in (select [ID] from Table1 where [Month/Year] = (select [month/year] from Temp_Date where id = (@Id -1)))

where t1.[Month/Year] = @my
group by t1.[Brand], t1.[Month/Year]

Which although it produced an output, I don't think it was right and it took ages, probably due to my liberal use of Sub-Queries in Joins.

My question is 2-fold – can someone assist in a better way to count the number of rows and join them and is there a better way to do this without so many sub-queries?

Best Answer

In general you should include the scripts that generate your Table schemas and populates them with data. (Also for performance questions, you should include the execution plan too.) I couldn't do a lot of testing without real data, but I believe this query should improve the performance you're seeing:

DECLARE @my DATE = '2019-12-01'
DECLARE @nextMonthYear DATE = DATEADD(MONTH, 1, @my)
DECLARE @prevMonthYear DATE = DATEADD(MONTH, -1, @my)

SELECT CurrentMonth.Brand, SUM(CASE WHEN NextMonth.ID IS NULL THEN 1 ELSE 0 END) AS Churn, SUM(CASE WHEN PreviousMonth.ID IS NULL THEN 1 ELSE 0 END) AS New, CurrentMonth.MonthYear
FROM Table1 AS CurrentMonth
LEFT JOIN Table1 AS NextMonth
    ON NextMonth.MonthYear = @nextMonthYear
    AND CurrentMonth.ID = NextMonth.ID  
LEFT JOIN Table1 AS PreviousMonth
    ON PreviousMonth.MonthYear = @prevMonthYear
    AND CurrentMonth.ID = PreviousMonth.ID  
WHERE CurrentMonth.MonthYear = @my
GROUP BY CurrentMonth.Brand, CurrentMonth.MonthYear

This also is a lot simpler (both for the developer to read and for SQL Server to parse) then a bunch of sub-queries. You're likely right that the unneeded additional joining to the sub-queries is a source of your performance issues (but I can't verify that without seeing the execution plan).

What the above query does is use a LEFT JOIN to Table1 on itself twice (once for the PreviousMonth and once for the NextMonth) by ID to get all the records that match and don't match to the CurrentMonth. Then using the SUM() aggregate function with a CASE statement inside of it, we can sum up only the NULLs (cases where the ID of the CurrentMonth did not exist in the PreviousMonth and NextMonth) to get the counts for the New and Churn, grouped by the Brand and MonthYear of the CurrentMonth.