Sql-server – Optimize Subquery with Windowing Function

performancequery-performancesql serversql-server-2012window functions

As my performance tuning skills never seem to feel sufficient, I always wonder if there is more optimization I can perform against some queries. The situation that this question pertains to is a Windowed MAX function nested within a subquery.

The data that I'm digging through is a series of transactions on various groups of larger sets. I've got 4 fields of importance, the unique ID of a transaction, the Group ID of a batch of transactions, and dates associated with the respective unique transaction or group of transactions. Most times the Group Date matches the Maximum Unique Transaction Date for a Batch, but there are times where manual adjustments come through our system and a unique date operation occurs after the group transaction date is captured. This manual edit doesn't adjust the group date by design.

What I identify in this query are those records where the Unique Date falls after the Group Date. The following sample query builds out a rough equivalent of the my scenario and the SELECT statement returns the records I'm looking for, however, am I approaching this solution in the most efficient manner? This takes a while to run during my fact table loads as my record counts number in the upper 9 digits, but mostly my disdain for subqueries makes me wonder if there's a better approach here. I'm not as concerned about any indexes as I'm confident those are already in place; what I'm looking for is an alternative query approach that will achieve the same thing, but even more efficiently. Any feedback is welcome.

CREATE TABLE #Example
(
    UniqueID INT IDENTITY(1,1)
  , GroupID INT
  , GroupDate DATETIME
  , UniqueDate DATETIME
)

CREATE CLUSTERED INDEX [CX_1] ON [#Example]
(
    [UniqueID] ASC
)


SET NOCOUNT ON

--Populate some test data
DECLARE @i INT = 0, @j INT = 5, @UniqueDate DATETIME, @GroupDate DATETIME

WHILE @i < 10000
BEGIN

    IF((@i + @j)%173 = 0)
    BEGIN
        SET @UniqueDate = GETDATE()+@i+5
    END
    ELSE
    BEGIN
        SET @UniqueDate = GETDATE()+@i
    END

    SET @GroupDate = GETDATE()+(@j-1)

    INSERT INTO #Example (GroupID, GroupDate, UniqueDate)
    VALUES (@j, @GroupDate, @UniqueDate)

    SET @i = @i + 1

    IF (@i % 5 = 0)
    BEGIN
        SET @j = @j+5
    END
END
SET NOCOUNT OFF

CREATE NONCLUSTERED INDEX [IX_2_4_3] ON [#Example]
(
    [GroupID] ASC,
    [UniqueDate] ASC,
    [GroupDate] ASC
)
INCLUDE ([UniqueID])

-- Identify any UniqueDates that are greater than the GroupDate within their GroupID
SELECT UniqueID
     , GroupID
     , GroupDate
     , UniqueDate
FROM (
    SELECT UniqueID
         , GroupID
         , GroupDate
         , UniqueDate
         , MAX(UniqueDate) OVER (PARTITION BY GroupID) AS maxUniqueDate
    FROM #Example
    ) calc_maxUD
WHERE maxUniqueDate > GroupDate
    AND maxUniqueDate = UniqueDate

DROP TABLE #Example

dbfiddle here

Best Answer

I'm assuming there's no index, as you haven't provided any.

Right off the bat, the following index will eliminate a Sort operator in your plan, which would otherwise potentially consume a lot of memory:

CREATE INDEX IX ON #Example (GroupID, UniqueDate) INCLUDE (UniqueID, GroupDate);

The subquery isn't a performance problem in this case. If anything, I would look at ways to eliminate the window function (MAX... OVER) to avoid the Nested Loop and Table Spool construct.

With the same index, the following query may at first glance look less efficient, and it does go from two to three scans on the base table, but it eliminates a huge number of reads internally because it lacks Spool operators. I'm guessing that it'll still perform better, particularly if you have enough CPU cores and IO performance on your server:

SELECT e.UniqueID
     , e.GroupID
     , e.GroupDate
     , e.UniqueDate
FROM (
    SELECT GroupID, MAX(UniqueDate) AS maxUniqueDate
    FROM #Example
    GROUP BY GroupID) AS agg
INNER JOIN #Example AS e ON agg.GroupID=e.GroupID
WHERE agg.maxUniqueDate > e.GroupDate
    AND agg.maxUniqueDate = e.UniqueDate
OPTION (MERGE JOIN);

(Note: I added a MERGE JOIN query hint, but this should probably happen automatically if your statistics are in order. Best practice is to leave hints like these out if you can.)