SQL Server – How to Get Seek Based Parallel Plan for DISTINCT/GROUP BY

distinctgroup byindexoptimizationsql server

An example from this question shows that SQL Server will choose a full index scan to solve a query like this:

select distinct [typeName] from [types]

Where [typeName] has a non-clustered, non-unique ascending index on it. His example has 200M rows but only 76 unique values. It seems like a seek plan would be a better choice with that density (~76 multiple binary searchs)?

His case could be normalized but the reason for the question is that I really want to solve something like this:

select TransactionId, max(CreatedUtc) 
from TxLog 
group by TransactionId

There is an index on (TransactionId, MaxCreatedUtc).

Re-writing using a normalized source (dt) does not not change the plan.

select dt.TransactionId, MaxCreatedUtc
 from [Transaction] dt -- distinct transactions
 cross apply
   (
        select Max(CreatedUtc) as MaxCreatedUtc 
          from TxLog tl
         where tl.TransactionId = dt.TransactionId         
   ) ca

Running just the CA subquery as a Scalar UDF does show a plan of 1 seek.

select max(CreatedUtc) as MaxCreatedUtc
 from Pub.TransactionLog 
 where TransactionID = @TxId;

Using that Scalar UDF in the original query seems to work but loses parallelism (known issue with UDFs):

select t.typeName, 
       Pub.ufn_TransactionMaxCreatedUtc(t.TransactionId) as MaxCreatedUtc
  from Pub.[Transaction] t

Plans for Cross apply, Just the UDF, Using UDF

Rewriting using an Inline TVF reverts it back to the scan based plan.

From answer/comment @ypercube:

select TransactionId, MaxCreatedUtc        
 from Pub.[Transaction]  t
  cross apply
   (
        select top (1) CreatedUtc as MaxCreatedUtc 
        from Pub.TransactionLog l
        where l.TransactionID = t.TransactionId
        order by CreatedUtc desc                     
   ) ca

Plan using top/order

Plan looks good. No parallelism but pointless since so fast. Will have to try this on a larger problem sometime. Thanks.

Best Answer

I have exactly the same set up and I've been through the same stages of rewriting the query.

In my case the table names and meaning is a bit different, but overall structure is the same. Your table Transactions corresponds to my table PortalElevators below. It has ~2000 rows. Your table TxLog corresponds to my table PlaybackStats. It has ~150M rows. It has index on (ElevatorID, DataSourceRowID), same as you.

I'll run several variants of the query on the real data and compare execution plans, IO and time stats. I'm using SQL Server 2008 Standard.

GROUP BY with MAX

SELECT [ElevatorID], MAX([DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
GROUP BY [ElevatorID]

group by

group by stats

group by io

Same as for you optimizer scans the index and aggregates results. Slow.

Individual row

Let's see what optimizer would do if I requested MAX just for one row:

SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = 1

individual

Optimizer is smart enough to use the index and it does one seek. By the way, we can see that optimizer uses TOP operator, even though the query doesn't have it. This is a telling sign that optimization paths of MAX and TOP have something in common in the engine, but they are different as we'll see below.

CROSS APPLY with MAX

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
    ) AS CA
;

cross apply with max

cross apply with max stats

cross apply with max io

Optimizer still scans the index. It is not smart enough to convert MAX into TOP and scan into seeks here. Slow. I didn't think of this variant originally, my next try was scalar UDF.

Scalar UDF

I saw that plan for getting MAX for individual row had index seek, so I put that simple query in a scalar UDF.

CREATE FUNCTION [dbo].[GetElevatorLastID]
(
    @ParamElevatorID int
)
RETURNS bigint
AS
BEGIN
    DECLARE @Result bigint;
    SELECT @Result = MAX([dbo].[PlaybackStats].[DataSourceRowID])
    FROM [dbo].[PlaybackStats]
    WHERE [dbo].[PlaybackStats].ElevatorID = @ParamElevatorID;
    RETURN @Result;
END

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,[dbo].[GetElevatorLastID]([dbo].[PortalElevators].elevatorsId) AS LastItemID
FROM
    [dbo].[PortalElevators]
;

udf

udf stats

udf io

It does run fast. At least, much faster than Group by. Unfortunately, execution plan doesn't show details of UDF and what is even worse, it doesn't show the real IO stats (it doesn't include IO generated by UDF). You need to run Profiler to see all calls of the function and their stats. This plan shows only 6 reads. The plan for individual row has 4 reads, so real number would be close to: 6 + 2779 * 4 = 6 + 11,116 = 11,122.

CROSS APPLY with TOP

Eventually, I discovered the CROSS APPLY and how it can be applied ;-) in this case.

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT TOP(1) [dbo].[PlaybackStats].[DataSourceRowID] AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
        ORDER BY [dbo].[PlaybackStats].[DataSourceRowID] DESC
    ) AS CA
;

cross apply with top

cross apply with top stats

cross apply with top io

Here optimizer is smart enough to do ~2000 seeks. You can see that number of reads is much lower than for group by. Fast.

Interestingly, number of reads here (11,850) is a bit more than reads that I estimated with UDF (11,122). Table IO stats with CROSS APPLY have 11,844 reads and 2,779 scan count of the large table, which gives 11,844 / 2,779 ~= 4.26 reads per index seek. Most likely, seeks for some values use 4 reads and for some 5, with average 4.26. There are 2,779 seeks, but there are values only for 2,130 rows. As I said, it is difficult to get real number of reads with UDF without profiler.

Recursive CTE

As was pointed in comments, Paul White described a Recursive Index Skip Scan method to find distinct values in a large table without performing a full index scan, but doing index seeks recursively. To start recursion we need to find the MIN or MAX value for an anchor and then each step of recursion adds next value one by one. The post explains it in details.

WITH RecursiveCTE
AS
(
    -- Anchor
    SELECT TOP (1) [ElevatorID], [DataSourceRowID]
    FROM [dbo].[PlaybackStats]
    ORDER BY [ElevatorID] DESC, [DataSourceRowID] DESC

    UNION ALL

    -- Recursive
    SELECT R.[ElevatorID], R.[DataSourceRowID]
    FROM
    (
        -- Number the rows
        SELECT
            T.[ElevatorID], T.[DataSourceRowID]
            ,ROW_NUMBER() OVER (ORDER BY T.[ElevatorID] DESC, T.[DataSourceRowID] DESC) AS rn
        FROM
            [dbo].[PlaybackStats] AS T
            INNER JOIN RecursiveCTE AS R ON R.[ElevatorID] > T.[ElevatorID]
    ) AS R
    WHERE
        -- Only the row that sorts lowest
        R.rn = 1
)
SELECT [ElevatorID], [DataSourceRowID]
FROM RecursiveCTE
OPTION (MAXRECURSION 0);

recursive

recursive stats

recursive io

It is pretty fast, though it performs almost twice amount of reads as CROSS APPLY. It does 12,781 reads in Worktable and 8,524 in PlaybackStats. On the other hand, it performs as many seeks as there are distinct values in the large table. CROSS APPLY with TOP performs as many seeks as there are rows in the small table. In my case small table has 2,779 rows, but large table has only 2,130 distinct values.

Summary

                         Logical Reads       Duration
CROSS APPLY with MAX           482,121          6,604
GROUP BY with MAX              482,123          6,581
Scalar UDF                    ~ 11,122            728
Recursive                       21,305             30
CROSS APPLY with TOP            11,850              9 (nine!)

I ran each query three times and picked the best time. There were no physical reads.

Conclusion

In this special case of greatest-n-per-group problem we have:

  • n=1;
  • number of groups is much smaller than the number of rows in a table;
  • there is appropriate index;

Two best methods are:

  1. In case when we have a small table with the list of groups, the best method is CROSS APPLY with TOP.

  2. In case when we have only large table, the best method is Recursive Index Skip Scan.