Sql-server – I’ve got 60 million sales of 800 different liquid products whose price change daily

performanceperformance-tuningsql serversql-server-2017

..and I'm having a bit of a battle getting to customer stats like who's bought the most, and when was their last purchase.

I've no problem writing the query, it's just really slow

Sales
CustID, ProdID, Qty, SaleTimestamp
1,1,10,2000-01-01 01:01:01

ProductPrices
ProdID,Price,PriceDate
1,100,2000-01-01

Sales are recorded as an amount sold at a time/date. All products change price daily, so to find the total sales to all customers we need to join Sales and ProductPrices on productID and date of sale

SELECT
  CustID, SUM(Qty*Price), MAX(s.SaleTimestamp)
FROM
  Sales s
  INNER JOIN
  ProductPrices p
  ON s.ProdID = p.ProdID and CAST(s.SaleTimestamp as date) = p.PriceDate
GROUP BY
  CustID
HAVING SUM(Qty*Price) > 10000 --only big customers

It's a simple enough query, but it's been running for 45 minutes now. I've got 60 million rows in sales, and 800 products * 365 days = 292,000 prices. Sales:ProductPrices is Many:One

I've tried another way round to try and pre-reduce the dataset to just customer-product-day-totalqty:

WITH s AS (
 SELECT
  CustID, ProdId, CAST(SaleTimestamp as date) Dat, SUM(Qty) TotQty
 FROM
  Sales
 GROUP BY CustID, ProdId, CAST(SaleTimestamp as date)
) 

SELECT CustId, SUM(s.TotQty * p.Price) Tot
  ProductPrices p
  INNER JOIN
  s
  ON
    p.PriceDate = s.Dat and p.ProdId = s.ProdId
GROUP BY CustID
HAVING SUM(Qty*Price) > 10000 --only big customers

But no luck. I've wrapped the entire thing in a count(*) to test whether it was completing the query quickly but taking ages to farm N million rows back to SSMS – the count is still slow too. I've swapped the cast to date out for a BETWEEN style on the non CTE query (either way round: saletimestamp between pricedate and pricedate+1day / pricedate between saledate-1day and saledate) but it appeared to make it worse, the most costly operation becoming a merge join on the tables, with SQLS stating that it was a many:many

I'm pretty sure the join is the problem, as I can do this:

WITH s AS (
 SELECT
  CustID, ProdId, CAST(SaleTimestamp as date) Dat, SUM(Qty) TotQty
 FROM
  Sales
 GROUP BY CustID, ProdId, CAST(SaleTimestamp as date)
) 
SELECT COUNT(*) FROM s

And I get an answer of 13 million rows in 6 seconds.. If I do the same thing with products:

WITH p AS (
 SELECT
  ProdId, PriceDate, AVG(Price) price
 FROM
  ProductPrices
 GROUP BY ProdId, PriceDate
) 
SELECT COUNT(*) FROM p

I get 292000 in less than 1 second. If I take these two queries and put them together for my answers:

WITH s AS (
 SELECT
  CustID, ProdId, CAST(SaleTimestamp as date) Dat, SUM(Qty) TotQty
 FROM
  Sales
 GROUP BY CustID, ProdId, CAST(SaleTimestamp as date)
),
p AS (
 SELECT
  ProdId, PriceDate Dat, AVG(Price) price
 FROM
  ProductPrices
 GROUP BY ProdId, PriceDate
) 
SELECT COUNT(*) FROM(
  SELECT s.CustID, SUM(TotQty * Price)
  FROM p INNER JOIN s ON p.ProdId = s.ProdId and p.Dat = s.Dat
)

I'm back to 20+ minutes now and no result ..

I realise that it's probably hard to answer without seeing the plan – do I post the XML?

Server hardware is pretty immense – multiple 20 core Xeon Gold CPUs and 2TB ram

Best Answer

Without seeing the execution plan, it's impossible to say exactly what the problem was. However, the root cause seemed to be with the way the optimizer was handling joining the queries within your CTEs (or something along those lines).

In SQL Server CTEs aren't currently materialized or fenced off (unless you use TOP inside one). In many cases, breaking up queries (whether they use CTEs or not) can be helpful to the optimizer.

As of this answer, there are no hints to alter the way CTEs are handled.