Sql-server – Optimize Subquery with Windowing Function

performancequery-performancesql serversql-server-2012window functions

As my performance tuning skills never seem to feel sufficient, I always wonder if there is more optimization I can perform against some queries. The situation that this question pertains to is a Windowed MAX function nested within a subquery.

The data that I'm digging through is a series of transactions on various groups of larger sets. I've got 4 fields of importance, the unique ID of a transaction, the Group ID of a batch of transactions, and dates associated with the respective unique transaction or group of transactions. Most times the Group Date matches the Maximum Unique Transaction Date for a Batch, but there are times where manual adjustments come through our system and a unique date operation occurs after the group transaction date is captured. This manual edit doesn't adjust the group date by design.

What I identify in this query are those records where the Unique Date falls after the Group Date. The following sample query builds out a rough equivalent of the my scenario and the SELECT statement returns the records I'm looking for, however, am I approaching this solution in the most efficient manner? This takes a while to run during my fact table loads as my record counts number in the upper 9 digits, but mostly my disdain for subqueries makes me wonder if there's a better approach here. I'm not as concerned about any indexes as I'm confident those are already in place; what I'm looking for is an alternative query approach that will achieve the same thing, but even more efficiently. Any feedback is welcome.

CREATE TABLE #Example
(
    UniqueID INT IDENTITY(1,1)
  , GroupID INT
  , GroupDate DATETIME
  , UniqueDate DATETIME
)

CREATE CLUSTERED INDEX [CX_1] ON [#Example]
(
    [UniqueID] ASC
)


SET NOCOUNT ON

--Populate some test data
DECLARE @i INT = 0, @j INT = 5, @UniqueDate DATETIME, @GroupDate DATETIME

WHILE @i < 10000
BEGIN

    IF((@i + @j)%173 = 0)
    BEGIN
        SET @UniqueDate = GETDATE()+@i+5
    END
    ELSE
    BEGIN
        SET @UniqueDate = GETDATE()+@i
    END

    SET @GroupDate = GETDATE()+(@j-1)

    INSERT INTO #Example (GroupID, GroupDate, UniqueDate)
    VALUES (@j, @GroupDate, @UniqueDate)

    SET @i = @i + 1

    IF (@i % 5 = 0)
    BEGIN
        SET @j = @j+5
    END
END
SET NOCOUNT OFF

CREATE NONCLUSTERED INDEX [IX_2_4_3] ON [#Example]
(
    [GroupID] ASC,
    [UniqueDate] ASC,
    [GroupDate] ASC
)
INCLUDE ([UniqueID])

-- Identify any UniqueDates that are greater than the GroupDate within their GroupID
SELECT UniqueID
     , GroupID
     , GroupDate
     , UniqueDate
FROM (
    SELECT UniqueID
         , GroupID
         , GroupDate
         , UniqueDate
         , MAX(UniqueDate) OVER (PARTITION BY GroupID) AS maxUniqueDate
    FROM #Example
    ) calc_maxUD
WHERE maxUniqueDate > GroupDate
    AND maxUniqueDate = UniqueDate

DROP TABLE #Example

dbfiddle here

Best Answer

I'm assuming there's no index, as you haven't provided any.

Right off the bat, the following index will eliminate a Sort operator in your plan, which would otherwise potentially consume a lot of memory:

CREATE INDEX IX ON #Example (GroupID, UniqueDate) INCLUDE (UniqueID, GroupDate);

The subquery isn't a performance problem in this case. If anything, I would look at ways to eliminate the window function (MAX... OVER) to avoid the Nested Loop and Table Spool construct.

With the same index, the following query may at first glance look less efficient, and it does go from two to three scans on the base table, but it eliminates a huge number of reads internally because it lacks Spool operators. I'm guessing that it'll still perform better, particularly if you have enough CPU cores and IO performance on your server:

SELECT e.UniqueID
     , e.GroupID
     , e.GroupDate
     , e.UniqueDate
FROM (
    SELECT GroupID, MAX(UniqueDate) AS maxUniqueDate
    FROM #Example
    GROUP BY GroupID) AS agg
INNER JOIN #Example AS e ON agg.GroupID=e.GroupID
WHERE agg.maxUniqueDate > e.GroupDate
    AND agg.maxUniqueDate = e.UniqueDate
OPTION (MERGE JOIN);

(Note: I added a MERGE JOIN query hint, but this should probably happen automatically if your statistics are in order. Best practice is to leave hints like these out if you can.)

Related Solutions

Sql-server – Page Split Timing

The UPDATE would happen after the split because from a data state perspective, SQL Server will never overwrite another currently-allocated row in the process.

Moreover, if SQL Server did overwrite a portion of another row, and that row had to be moved, it wouldn't know what data to copy to the new page. A copy of the row could be kept in a temporary buffer in memory... which... is the very definition of a data page.

And so the splitting process goes as follows:

Allocate a new page
Copy the split rows to the new page
Deallocate the split rows from the original page
Did we reach at least the target amount of free space? If yes, we're done; if no, split again.

Finally, the UPDATE occurs, which is always free to overwrite unallocated portions of the page.

Improve performance with the WHERE NOT IN sub-select clause

With queries like this is it often more efficient to perform a LEFT OUTER JOIN instead of the NOT EXISTS style check, it often implies a full index scan (or table scan without the right indexes in place) but with many rows in the main table(s) this is less expensive than the large number of index seeks (one on the reference table for each row returned from the main table) that would otherwise result. Some query planners are quite bright about spotting this equivalence and using the alternate plan where it is the better choice, but it doesn't sound like this has happened in your case.

Try something like:

SELECT t1.CUSTID, COUNT(*)
FROM   CUST_TRX t1
LEFT OUTER JOIN
       CUST_TRX t2 
ON     t2.CUSTID=t1.CUSTID 
AND    t2.DATED<CURRENT_DATE-365
WHERE  t2.CUSTID IS NULL
GROUP BY t1.CUSTID

(note: I'm not familiar with firebird, so the above syntax may need tweaks but should illustrate the point)

Without the WHERE t2.CUSTID IS NULL every row from t1 with matches in t2 will be output once for every match found in t2 and those with no matches in t2 will be output once but with any columns selected from that object set to NULL. The WHERE clause then screens out the matches.

Depending on the DB engine's abilities, especially if the amount of data in the reference object (CUST_TRX with a filter applied here) is huge, this may be significantly less efficient than the WHERE <something> NOT IN or WHERE NOT EXISTS options, so benchmark over realistic data sets first before using the method. It often works out much more efficient with MS SQL Server in cases where the query planner doesn't notice that the WHERE NOT IN arrangement can be performed this way more efficiently.

Also if you do it this way around leave a comment in the code (and/or supporting documentation) to say that you are doing this as an equivalent to WHERE <something> NOT IN or WHERE NOT EXISTS which you expect to be more efficient. You'll remember it and an experienced SQL person will recognise the pattern, but other people looking at the code might not immediately understand the intent/reason and flip it back to using WHERE NOT EXISTS for clarity as that reads better as on English sentence.

Best Answer

Related Solutions

Sql-server – Page Split Timing

Improve performance with the WHERE NOT IN sub-select clause

Related Question