Sql-server – Consolidating multiple related queries, each using different window and/or criteria for counting records

sql serversql server 2014window functions

I'm accumulating results for multiple contingency tables, one for each Test. As an example:

Contingency Table for TestA (one for each test, and for each word)
                EventsWithWord  EventsWithoutWord
TestA passed          n11                 n10         
TestA failed          n01                 n00   

For analysis, I need the number of events matching each of the following categories (this is how R expects the data results):

  • TestA passed for each word: n11
  • TestA results available for each word: n_1 (sum of n11 and n01)
  • TestA passed: n1_ (sum of n11 and n10)
  • TestA results available: n__

Table "TestResults": columns: CompanyID nvarchar(20) NOT NULL, TestName nvarchar(255) NOT NULL, CompanyResults bit NULL

Table "Data" (one row for each event): Columns: EventID int NOT NULL, CompanyID nvarchar(20) NOT NULL

Table "WordEventMap": Columns: EventID int NOT NULL, Word nvarchar(255) NOT NULL

Is there a way of combining some of the queries (e.g., with windowing functions)? Some of these tables are very large–40 million rows currently in WordEventMap and 1.3 million rows in Data, and I expect the results table will have about 4 million rows (40 different tests, 100,000 different words). The current execution plan does not consolidate any of the Common Table Expressions.

Current query:

WITH CTE1
AS (SELECT
  COUNT(DISTINCT Data.EventID) AS n__,
  TestResults.TestName
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName),
CTE2
AS (SELECT
  COUNT(DISTINCT Data.[EventID]) AS n11,
  TestResults.TestName,
  WordEventMap.Word
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
  ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults = 1
GROUP BY TestResults.TestName,
         WordEventMap.Word),
CTE3
AS (SELECT
  COUNT(DISTINCT Data.[EventID]) AS n1_,
  TestResults.TestName
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults = 1
GROUP BY TestResults.TestName),
CTE4
AS (SELECT
  COUNT(DISTINCT Data.[EventID]) AS n_1,
  TestResults.TestName,
  WordEventMap.Word
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
  ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName,
         WordEventMap.Word)
SELECT
  CTE2.TestName,
  CTE2.Word,
  n11,
  n1_,
  n_1,
  n__
FROM CTE2
INNER JOIN CTE4
  ON CTE2.Word = CTE4.Word
  AND CTE2.TestName = CTE4.TestName
INNER JOIN CTE1
  ON CTE2.TestName = CTE1.TestName
INNER JOIN CTE3
  ON CTE2.TestName = CTE3.TestName

Here's a SQL fiddle using the above structure and the below data. The above query is saved as the View [Results]. http://sqlfiddle.com/#!6/a8155e/1

Sample data:

Table TestResults
CompanyA TestA 1
CompanyA TestB 0
CompanyB TestA NULL
CompanyB TestB 1
CompanyC TestA 0
CompanyC TestB 1

Table Data
1    CompanyA
2    CompanyA
3    CompanyB
4    CompanyC
5    CompanyB


Table WordEventMap
1 airplane
1 tightrope
1 eggplant
2 eggplant
2 aardvark
2 eggbeater
3 airplane
3 aardvark
3 spaghetti
4 airplane
4 eggplant
4 wikipedia
5 eggplant
5 eggbeater
5 tightrope
5 licorice

Results Set
TestA   aardvark    1   2   1   3
TestB   aardvark    1   3   2   5
TestA   airplane    1   2   2   3
TestB   airplane    2   3   3   5
TestA   eggbeater   1   2   1   3
TestA   eggplant    1   2   2   3
TestB   eggplant    2   3   3   5
TestB   licorice    1   3   1   5
TestB   spaghetti   1   3   1   5
TestA   tightrope   1   2   1   3
TestB   tightrope   1   3   2   5
TestB   wikipedia   1   3   1   5

Best Answer

When you think that optimizer should be able to combine two CTEs into one why don't you write a single CTE, human beings are still smarter than optimizers :-)

As there's only a different WHERE-condition the CTEs 2&4 and 1&3 can be combined using conditional aggregation:

WITH CTE1
AS (SELECT
  COUNT(DISTINCT Data.EventID) AS n__,
  COUNT(DISTINCT CASE WHEN TestResults.CompanyResults = 1 THEN Data.[EventID] END) AS n1_,  
  TestResults.TestName
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName),
CTE2
AS (SELECT
  COUNT(DISTINCT Data.[EventID]) AS n_1,
  COUNT(DISTINCT CASE WHEN TestResults.CompanyResults = 1 THEN Data.[EventID] END) AS n11,
  TestResults.TestName,
  WordEventMap.Word
FROM TestResults
INNER JOIN Data
  ON TestResults.CompanyID = Data.CompanyID
INNER JOIN WordEventMap
  ON Data.[EventID] = WordEventMap.EventID
WHERE TestResults.CompanyResults IS NOT NULL
GROUP BY TestResults.TestName,
         WordEventMap.Word)
SELECT
  CTE2.TestName,
  CTE2.Word,
  n11,
  n1_,
  n_1,
  n__
FROM CTE2
INNER JOIN CTE1
  ON CTE2.TestName = CTE1.TestName
WHERE n11 > 0  
;

But the main performance problem are those DISTINCTs, based on uniqueness in your actual data you might be able to remove some (for your example data no DISTINCT is needed).