As my performance tuning skills never seem to feel sufficient, I always wonder if there is more optimization I can perform against some queries. The situation that this question pertains to is a Windowed MAX function nested within a subquery.
The data that I'm digging through is a series of transactions on various groups of larger sets. I've got 4 fields of importance, the unique ID of a transaction, the Group ID of a batch of transactions, and dates associated with the respective unique transaction or group of transactions. Most times the Group Date matches the Maximum Unique Transaction Date for a Batch, but there are times where manual adjustments come through our system and a unique date operation occurs after the group transaction date is captured. This manual edit doesn't adjust the group date by design.
What I identify in this query are those records where the Unique Date falls after the Group Date. The following sample query builds out a rough equivalent of the my scenario and the SELECT statement returns the records I'm looking for, however, am I approaching this solution in the most efficient manner? This takes a while to run during my fact table loads as my record counts number in the upper 9 digits, but mostly my disdain for subqueries makes me wonder if there's a better approach here. I'm not as concerned about any indexes as I'm confident those are already in place; what I'm looking for is an alternative query approach that will achieve the same thing, but even more efficiently. Any feedback is welcome.
CREATE TABLE #Example
(
UniqueID INT IDENTITY(1,1)
, GroupID INT
, GroupDate DATETIME
, UniqueDate DATETIME
)
CREATE CLUSTERED INDEX [CX_1] ON [#Example]
(
[UniqueID] ASC
)
SET NOCOUNT ON
--Populate some test data
DECLARE @i INT = 0, @j INT = 5, @UniqueDate DATETIME, @GroupDate DATETIME
WHILE @i < 10000
BEGIN
IF((@i + @j)%173 = 0)
BEGIN
SET @UniqueDate = GETDATE()+@i+5
END
ELSE
BEGIN
SET @UniqueDate = GETDATE()+@i
END
SET @GroupDate = GETDATE()+(@j-1)
INSERT INTO #Example (GroupID, GroupDate, UniqueDate)
VALUES (@j, @GroupDate, @UniqueDate)
SET @i = @i + 1
IF (@i % 5 = 0)
BEGIN
SET @j = @j+5
END
END
SET NOCOUNT OFF
CREATE NONCLUSTERED INDEX [IX_2_4_3] ON [#Example]
(
[GroupID] ASC,
[UniqueDate] ASC,
[GroupDate] ASC
)
INCLUDE ([UniqueID])
-- Identify any UniqueDates that are greater than the GroupDate within their GroupID
SELECT UniqueID
, GroupID
, GroupDate
, UniqueDate
FROM (
SELECT UniqueID
, GroupID
, GroupDate
, UniqueDate
, MAX(UniqueDate) OVER (PARTITION BY GroupID) AS maxUniqueDate
FROM #Example
) calc_maxUD
WHERE maxUniqueDate > GroupDate
AND maxUniqueDate = UniqueDate
DROP TABLE #Example
dbfiddle here
Best Answer
I'm assuming there's no index, as you haven't provided any.
Right off the bat, the following index will eliminate a Sort operator in your plan, which would otherwise potentially consume a lot of memory:
The subquery isn't a performance problem in this case. If anything, I would look at ways to eliminate the window function (MAX... OVER) to avoid the Nested Loop and Table Spool construct.
With the same index, the following query may at first glance look less efficient, and it does go from two to three scans on the base table, but it eliminates a huge number of reads internally because it lacks Spool operators. I'm guessing that it'll still perform better, particularly if you have enough CPU cores and IO performance on your server:
(Note: I added a
MERGE JOIN
query hint, but this should probably happen automatically if your statistics are in order. Best practice is to leave hints like these out if you can.)