I am optimizing a query, where I have to join two tables, let's say Product
and TransactionHistory
table, and return multiple columns from both table for the last TransactionDate
from the TransactionHistory
table for each Product in the Product
table.
TransactionHistory
table has around 13 Million rows and the Product
table has almost 2 million rows.
Sounds pretty easy, doesn't it? but for different scenarios different query perform differently.
I am not using my actual table names here, to show the different query syntax I have tried, I am using Adam Machanic's BigProduct and BigSalesHistory Tables.
The query was originally written like….
SELECT p.ProductID
, p.Name
, h.Quantity
, h.TransactionDate
FROM [dbo].[bigProduct] p
INNER JOIN [dbo].[bigTransactionHistory] h ON p.ProductID = h.ProductID
WHERE h.TransactionDate = ( SELECT MAX(TransactionDate)
FROM [dbo].[bigTransactionHistory]
WHERE ProductID = h.ProductID )
AND p.ProductID > 1317
AND p.ProductID < 1416
GO
Which resulted in some 30 scan counts and almost 300,000 logical reads.
I being a big fan of windowing functions wrote the following queries using ROW_NUMBER()
, both backfired badly:
SELECT p.ProductID
, p.Name
, c.*
FROM [dbo].[bigProduct] p
INNER JOIN ( SELECT h.ProductID
, h.Quantity
, h.TransactionDate
, ROW_NUMBER() OVER (PARTITION BY h.ProductID
ORDER BY h.TransactionDate DESC) rn
FROM [dbo].[bigTransactionHistory] h
) c
ON p.ProductID = c.ProductID AND rn = 1
WHERE p.ProductID > 1317
AND p.ProductID < 1416
GO
AND
SELECT p.ProductID
, p.Name
, c.*
FROM [dbo].[bigProduct] p
CROSS APPLY ( SELECT h.ProductID
, h.Quantity
, h.TransactionDate
, ROW_NUMBER() OVER (PARTITION BY h.ProductID
ORDER BY h.TransactionDate DESC) rn
FROM [dbo].[bigTransactionHistory] h
WHERE p.ProductID = ProductID
) c
WHERE p.ProductID > 1317
AND p.ProductID < 1416
AND rn = 1
GO
The Scan count went into 100s and the logical reads jumped to 700,000s.
Being pretty disappointed by the windowing function's performance I rewrote the query using CROSS APPLY
SELECT p.ProductID
, p.Name
, c.*
FROM [dbo].[bigProduct] p
CROSS APPLY ( SELECT TOP 1 h.ProductID
, h.Quantity
, h.TransactionDate
FROM [dbo].[bigTransactionHistory] h
WHERE p.ProductID = h.ProductID
ORDER BY h.TransactionDate
) c
WHERE p.ProductID > 1317
AND p.ProductID < 1416
GO
Now this query seems to perform the best out of all the above mentioned but as soon as the OUTER Query's Range for ProductID
increases this query goes horribly wrong and starts performing worse than the above two.
Also even though it reduced the logical reads down to 27000 but the CPU time and the elapsed time jumped up by 8 times.
Is there a more efficient way of writing a query like this, keeping in mind that the outer query range for ProductID varies a lot?.
Any suggestion or pointers in the right direction are much appreciated. Thank you.
Best Answer
You need to think about how you would like to solve this if you had a paper-based system.
Like - do you want to use each ProductID one by one and find the newest one? Then your Nested Loop that calls a Seek using TOP 1 on an Index on (ProductID, Transaction Date DESC) will be best.
Do you want to search through all your transactions and group them by ProductID and grab the max? Then that approach could be best.
But I'd suggest to you that what you'd do in real life is to do the second approach on a subset of your data (such as the most recent month), and then do a product-by-product search on the ones that didn't work. Try an outer join onto a subquery that groups by ProductID and filters by TransactionDate to "fairly recent". Then do an Outer Apply to do a Top 1, but include a Predicate in there like
where m.ProductID IS NULL
, so that it only does those Seeks for products that were missed in the first round. You should see a Filter with a Startup Expression Predicate, and notice that the Index Seek is executed a lot fewer times.I show this kind of thing at http://blogs.lobsterpot.com.au/2014/07/08/ssis-lookup-transformation-in-t-sql/ - look for the plan I show, on the example with two Outer Apply bits.