Sql-server – the optimum method of finding the per-item earliest date from a clustered index

greatest-n-per-groupperformancequery-performancesql serversql-server-2012

I have a SQL Server 2012 table which contains columns like this:

ID int NOT NULL,
EventDate datetime NOT NULL,
... 32 other columns...

where the table has about half a billion rows across a range of about 10000 distinct ID values.

The table has a unique clustered index like this:

CREATE UNIQUE CLUSTERED INDEX [MyIndex] ON [dbo].[MyTable] (
    [ID] ASC,
    [EventDate] ASC
)

I need to find the earliest per-ID EventDate, which I can obtain using a query such as:

SELECT ID, min(EventDate) FROM [dbo].[MyTable] GROUP BY ID

However this query takes just under 2 minutes to complete.

I can't share the specifics (query plans etc.) of the problem I'm looking at due to NDA constraints, but I can advise that I'm seeing a clustered index scan, so it's checking all the rows in the table. Given that the data is organised in EventDate sequence I'd expect that the retrieval could be much quicker, but I'm not sure quite how. Any other ID specific range query responds in a few milliseconds and the table has recently been rebuilt and reindexed, so I don't think there are any statistic updates which would help.

Can anyone suggest a better method of determining the minimum per-ID EventDate value which avoids scanning the entire clustered index?

I do have a table with the (10 thousand) distinct id values.

Best Answer

This is related to "Index Skip Scan" optimization (see the Connect item below, from 2011). Unfortunately it has been closed as "Won't Fix".

Some related enhancements are already there but only for partitioned tables: Query Processing Enhancements on Partitioned Tables and Indexes.

Various workarounds exist though:

Workaround / solution 1: CROSS APPLY to subquery with TOP 1

If there is a table with the (10K) distinct ID values, we can use it to do a CROSS APPLY.

-- if you don't have a table already
CREATE TABLE MyTableIDs
  ( ID int NOT NULL,
    PRIMARY KEY (id)
  ) ;

-- we only do this once
INSERT INTO MyTableIDs (ID)
SELECT ID 
FROM MyTable
GROUP BY ID ;

Then the query will do (10K) seeks into the index:

SELECT  i.ID, a.EventDate
FROM    MyTableIDs AS i
    CROSS APPLY
        ( SELECT TOP (1) t.EventDate
          FROM     MyTable AS t
          WHERE    t.ID = i.ID
          ORDER BY EventDate
        ) AS a ;

Workaround / solution 2: Implement "Skip Scan" with Recursive CTE

Another option is a recursive CTE to implement the skip scan. Here is a demo script, posted by @PaulWhite on SSC in October 2010, demonstrating how much faster an rCTE can be: Calculating interest query

In your case, it would be something like this:

WITH    RecursiveCTE
AS      (
        SELECT  a.*
        FROM    ( SELECT TOP (1)  ID, EventDate
                  FROM     MyTable
                  ORDER BY ID, EventDATE
                ) AS a
        UNION   ALL
        SELECT  r.ID, r.EventDate
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE ;)
                SELECT  t.ID, t.EventDate,
                        rn = ROW_NUMBER() OVER (ORDER BY t.EventDate)
                FROM    MyTable AS t
                JOIN    RecursiveCTE AS r
                        ON  r.ID < t.ID
                ) AS r
        WHERE   r.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0) ;

Workaround / solution 3: Partitioned Table

Available if you have Enterprise edition. More details about this possibility in the linked article: Query Processing Enhancements on Partitioned Tables and Indexes

The main disadvantage is that the skip scan will work only if ID is the partitioning column.

Workaround / solution 4: Additional Index

Adding an NCI index on (ID, EventDate) allows an index scan on the much smaller index. See @Daniel Hutmacher's answer for an explanation.

It's still a scan though and not (many) seeks so I'm not sure if it will scale as good as options 1 and 2. Of course everything depends on details (column sizes, number of distinct IDs vs number of repeated values, etc).

~~Workaround / solution~~ (no, not working) 5: Indexed View

Nice idea, if it worked. Unfortunately a view cannot be indexed if it has GROUP BY and MIN/MAX, see restrictions of Indexed Views.

I wonder why, since it is allowed (and be indexed) if it has GROUP BY and COUNT_BIG(). There may be a Connect item about it, too!

Workaround / solution 6: "Do It Yourself" Indexed View

Implement an equivalent to an indexed view yourself. You could for example, add a MinEventDate column in MyTableIDs (the table with the distinct ID values, see option 1) and add triggers in MyTable that update this column accordingly. Then your query would be a simple SELECT from MyTableIDs.

Links to related web pages

Connect item: Implement Index Skip Scan
Implemented feature: Query Processing Enhancements on Partitioned Tables and Indexes
Indexed Views and restrictions: Create Indexed Views

Related Solutions

Sql-server – Finding exact row in Clustered index leaf page

The specifics of how SQL Server does it is not public.

But I recommend you read the vast research material on the subject which is public. Modern B-Tree Techniques is a good start. Many more papers from Goetz Graefe talk about this area, he is well known in the industry.

Interpolation search is a well known technique in B-Trees and there is plenty of material detailing possible approaches (bear in mind that none of the papers describe explicitly how SQL Server does it and there are many ways to skin a cat).

Sql-server – SHOWPLAN does not display a warning but “Include Execution Plan” does for the same query

This:

SET SHOWPLAN_XML ON;
GO
SELECT * FROM sys.objects;
GO

Is equivalent to pressing Display Estimated Execution Plan on the toolbar (or hitting Ctrl + L). You'll notice that no rows are returned from the query, like there is when you use Include Actual Execution Plan (Ctrl + M).

The spill warning is only a runtime warning. There is no way that SQL Server can know, when displaying the estimated plan, that a spill will happen at runtime. This is because a spill is caused by factors that might only be present during certain invocations of the query (for example, when there is memory pressure). The estimated plan knows roughly how much memory it's going to ask for, but it can't know until execution that it isn't going to get it.

As an aside, may I recommend* our free tool, SQL Sentry Plan Explorer? I think it provides much more obvious information than Management Studio. I recently wrote a lengthy blog post that can act as a tutorial, and Jonathan Kehayias has a great PluralSight course on it as well.

_{* Disclaimer: I work for SQL Sentry.}

Best Answer

Related Solutions

Sql-server – Finding exact row in Clustered index leaf page

Sql-server – SHOWPLAN does not display a warning but “Include Execution Plan” does for the same query

Related Question