Sql-server – Most efficient way to join two tables with MAX value

sql serversql server 2014

I am optimizing a query, where I have to join two tables, let's say Product and TransactionHistory table, and return multiple columns from both table for the last TransactionDate from the TransactionHistory table for each Product in the Product table.

TransactionHistory table has around 13 Million rows and the Product table has almost 2 million rows.

Sounds pretty easy, doesn't it? but for different scenarios different query perform differently.

I am not using my actual table names here, to show the different query syntax I have tried, I am using Adam Machanic's BigProduct and BigSalesHistory Tables.


The query was originally written like….

SELECT p.ProductID
     , p.Name
     , h.Quantity
     , h.TransactionDate
FROM [dbo].[bigProduct] p
INNER JOIN [dbo].[bigTransactionHistory] h ON p.ProductID = h.ProductID
WHERE h.TransactionDate = ( SELECT MAX(TransactionDate)
                            FROM [dbo].[bigTransactionHistory]
                            WHERE ProductID = h.ProductID )
AND p.ProductID > 1317
AND p.ProductID < 1416
GO

Which resulted in some 30 scan counts and almost 300,000 logical reads.


I being a big fan of windowing functions wrote the following queries using ROW_NUMBER(), both backfired badly:

SELECT p.ProductID
     , p.Name
     , c.*
FROM [dbo].[bigProduct] p
INNER JOIN ( SELECT  h.ProductID
                   , h.Quantity
                   , h.TransactionDate
                   , ROW_NUMBER() OVER (PARTITION BY h.ProductID 
                                            ORDER BY h.TransactionDate DESC) rn
              FROM [dbo].[bigTransactionHistory] h
            ) c
            ON p.ProductID = c.ProductID AND rn = 1
WHERE p.ProductID > 1317
  AND p.ProductID < 1416
GO

AND

SELECT p.ProductID
     , p.Name
     , c.*
FROM [dbo].[bigProduct] p
CROSS APPLY ( SELECT  h.ProductID
                   , h.Quantity
                   , h.TransactionDate
                   , ROW_NUMBER() OVER (PARTITION BY h.ProductID 
                                            ORDER BY h.TransactionDate DESC) rn
              FROM [dbo].[bigTransactionHistory] h
              WHERE p.ProductID = ProductID
            ) c
WHERE p.ProductID > 1317
  AND p.ProductID < 1416
  AND rn = 1
GO

The Scan count went into 100s and the logical reads jumped to 700,000s.


Being pretty disappointed by the windowing function's performance I rewrote the query using CROSS APPLY

SELECT p.ProductID
     , p.Name
     , c.*
FROM [dbo].[bigProduct] p
CROSS APPLY ( SELECT TOP 1  h.ProductID
                          , h.Quantity
                          , h.TransactionDate
              FROM [dbo].[bigTransactionHistory] h
              WHERE p.ProductID = h.ProductID
              ORDER BY h.TransactionDate 
             ) c
WHERE p.ProductID > 1317
  AND p.ProductID < 1416
GO

Now this query seems to perform the best out of all the above mentioned but as soon as the OUTER Query's Range for ProductID increases this query goes horribly wrong and starts performing worse than the above two.

Also even though it reduced the logical reads down to 27000 but the CPU time and the elapsed time jumped up by 8 times.


Is there a more efficient way of writing a query like this, keeping in mind that the outer query range for ProductID varies a lot?.

Any suggestion or pointers in the right direction are much appreciated. Thank you.

Best Answer

You need to think about how you would like to solve this if you had a paper-based system.

Like - do you want to use each ProductID one by one and find the newest one? Then your Nested Loop that calls a Seek using TOP 1 on an Index on (ProductID, Transaction Date DESC) will be best.

Do you want to search through all your transactions and group them by ProductID and grab the max? Then that approach could be best.

But I'd suggest to you that what you'd do in real life is to do the second approach on a subset of your data (such as the most recent month), and then do a product-by-product search on the ones that didn't work. Try an outer join onto a subquery that groups by ProductID and filters by TransactionDate to "fairly recent". Then do an Outer Apply to do a Top 1, but include a Predicate in there like where m.ProductID IS NULL, so that it only does those Seeks for products that were missed in the first round. You should see a Filter with a Startup Expression Predicate, and notice that the Index Seek is executed a lot fewer times.

I show this kind of thing at http://blogs.lobsterpot.com.au/2014/07/08/ssis-lookup-transformation-in-t-sql/ - look for the plan I show, on the example with two Outer Apply bits.