Sql-server – Count Rows that exist before and not after and after but not before

countjoin;sql server

I have an Azure SQL Database, one Table that has 3 Relevant Columns: Brand, ID and Month/Year and a Temporary Table that simply lists the Month/Year with an ID.

The table shows a list of all IDs that were active for that given month.

Month/Year is a datestamp column, with the date as the first of the Month. The ID Column is just an ID.

The goal is to have an output that for each month will show the count of all the records that were active in this month, but not active in the next month (Churn) and IDs that weren't active in the previous month but where active in this month (New) and have these grouped by brand and Month so:

Brand | Churn | New | Month
1 | 10 | 12 | 2019-12-01
2 | 11 | 9 | 2019-12-01

I should also note that Table1 has 19 million rows.

I tried the below code:

Declare @Id int
declare @my date

set @Id = 300
set @my = '2019-12-01'

select t1.[brand],
count(c.[ID]) as [churn],
count(n.[ID]) as [new],
t1.[month/year]
from Table1 T1

left join Table1 c on
c.[ID] = t1.[ID] and c.[ID] not in (select [ID] from Table1 where [Month/Year] = (select [month/year] from Temp_Date where id = (@Id +1)))
left join Table1 n on
n.[ID] = T1.[ID] and n.[ID] not in (select [ID] from Table1 where [Month/Year] = (select [month/year] from Temp_Date where id = (@Id -1)))

where t1.[Month/Year] = @my
group by t1.[Brand], t1.[Month/Year]

Which although it produced an output, I don't think it was right and it took ages, probably due to my liberal use of Sub-Queries in Joins.

My question is 2-fold – can someone assist in a better way to count the number of rows and join them and is there a better way to do this without so many sub-queries?

Best Answer

In general you should include the scripts that generate your Table schemas and populates them with data. (Also for performance questions, you should include the execution plan too.) I couldn't do a lot of testing without real data, but I believe this query should improve the performance you're seeing:

DECLARE @my DATE = '2019-12-01'
DECLARE @nextMonthYear DATE = DATEADD(MONTH, 1, @my)
DECLARE @prevMonthYear DATE = DATEADD(MONTH, -1, @my)

SELECT CurrentMonth.Brand, SUM(CASE WHEN NextMonth.ID IS NULL THEN 1 ELSE 0 END) AS Churn, SUM(CASE WHEN PreviousMonth.ID IS NULL THEN 1 ELSE 0 END) AS New, CurrentMonth.MonthYear
FROM Table1 AS CurrentMonth
LEFT JOIN Table1 AS NextMonth
    ON NextMonth.MonthYear = @nextMonthYear
    AND CurrentMonth.ID = NextMonth.ID  
LEFT JOIN Table1 AS PreviousMonth
    ON PreviousMonth.MonthYear = @prevMonthYear
    AND CurrentMonth.ID = PreviousMonth.ID  
WHERE CurrentMonth.MonthYear = @my
GROUP BY CurrentMonth.Brand, CurrentMonth.MonthYear

This also is a lot simpler (both for the developer to read and for SQL Server to parse) then a bunch of sub-queries. You're likely right that the unneeded additional joining to the sub-queries is a source of your performance issues (but I can't verify that without seeing the execution plan).

What the above query does is use a LEFT JOIN to Table1 on itself twice (once for the PreviousMonth and once for the NextMonth) by ID to get all the records that match and don't match to the CurrentMonth. Then using the SUM() aggregate function with a CASE statement inside of it, we can sum up only the NULLs (cases where the ID of the CurrentMonth did not exist in the PreviousMonth and NextMonth) to get the counts for the New and Churn, grouped by the Brand and MonthYear of the CurrentMonth.

Related Solutions

Sql-server – Selecting Distinct Rows but Counting All Rows

If you want the interestingData and timestamp from the same row (the most recent row that exceeds the threshold), and if you want to include all rows that exceed the threshold even if some rows for that ID don't meet the threshold, then:

;WITH x AS 
(
  SELECT ID, interestingData, [timestamp], 
    [count] = COUNT(1) OVER (PARTITION BY ID),
    rn = ROW_NUMBER() 
      OVER (PARTITION BY ID ORDER BY [timestamp] DESC)
  FROM dbo.tablename
  WHERE interestingData > 300
)
SELECT ID, interestingData, [timestamp], [count]
  FROM x
  WHERE rn = 1;

Also, try to avoid data types and/or reserved keywords as column names. timestamp is not a great choice because (a) it's not very meaningful and (b) it requires square brackets in a lot of scenarios.

Sql-server – Return all rows from table1 and corresponding rows from table2 even if null

You only need to move the WHERE conditions (that involve table2) to the joining, ON condition:

SELECT
   tb1.id,
   tb1.model,
   tb1.location,
   tb2.counter
FROM
   table1 tb1 LEFT JOIN table2 tb2 
   ON (  tb1.id = tb2.printerid
     AND tb2.year = 2017 
     AND tb2.month = 1 
      ) ;

The ON condition can be as complex as required, it doesn't necessarily be a simple equality, although that's the most common (the parentheses around the condition are not required either, just to make the code a bit more clear).

It's unclear how you get Printer, Pages, Year, Month columns in the result when you have id, model, location, counter in the SELECT list.

Best Answer

Related Solutions

Sql-server – Selecting Distinct Rows but Counting All Rows

Sql-server – Return all rows from table1 and corresponding rows from table2 even if null

Related Question