Sql-server – the optimum method of finding the per-item earliest date from a clustered index

greatest-n-per-groupperformancequery-performancesql serversql-server-2012

I have a SQL Server 2012 table which contains columns like this:

ID int NOT NULL,
EventDate datetime NOT NULL,
... 32 other columns...

where the table has about half a billion rows across a range of about 10000 distinct ID values.

The table has a unique clustered index like this:

CREATE UNIQUE CLUSTERED INDEX [MyIndex] ON [dbo].[MyTable] (
    [ID] ASC,
    [EventDate] ASC
)

I need to find the earliest per-ID EventDate, which I can obtain using a query such as:

SELECT ID, min(EventDate) FROM [dbo].[MyTable] GROUP BY ID

However this query takes just under 2 minutes to complete.

I can't share the specifics (query plans etc.) of the problem I'm looking at due to NDA constraints, but I can advise that I'm seeing a clustered index scan, so it's checking all the rows in the table. Given that the data is organised in EventDate sequence I'd expect that the retrieval could be much quicker, but I'm not sure quite how. Any other ID specific range query responds in a few milliseconds and the table has recently been rebuilt and reindexed, so I don't think there are any statistic updates which would help.

Can anyone suggest a better method of determining the minimum per-ID EventDate value which avoids scanning the entire clustered index?

I do have a table with the (10 thousand) distinct id values.

Best Answer

This is related to "Index Skip Scan" optimization (see the Connect item below, from 2011). Unfortunately it has been closed as "Won't Fix".

Some related enhancements are already there but only for partitioned tables: Query Processing Enhancements on Partitioned Tables and Indexes.

Various workarounds exist though:


Workaround / solution 1: CROSS APPLY to subquery with TOP 1

If there is a table with the (10K) distinct ID values, we can use it to do a CROSS APPLY.

-- if you don't have a table already
CREATE TABLE MyTableIDs
  ( ID int NOT NULL,
    PRIMARY KEY (id)
  ) ;

-- we only do this once
INSERT INTO MyTableIDs (ID)
SELECT ID 
FROM MyTable
GROUP BY ID ;

Then the query will do (10K) seeks into the index:

SELECT  i.ID, a.EventDate
FROM    MyTableIDs AS i
    CROSS APPLY
        ( SELECT TOP (1) t.EventDate
          FROM     MyTable AS t
          WHERE    t.ID = i.ID
          ORDER BY EventDate
        ) AS a ;

Workaround / solution 2: Implement "Skip Scan" with Recursive CTE

Another option is a recursive CTE to implement the skip scan. Here is a demo script, posted by @PaulWhite on SSC in October 2010, demonstrating how much faster an rCTE can be: Calculating interest query

In your case, it would be something like this:

WITH    RecursiveCTE
AS      (
        SELECT  a.*
        FROM    ( SELECT TOP (1)  ID, EventDate
                  FROM     MyTable
                  ORDER BY ID, EventDATE
                ) AS a
        UNION   ALL
        SELECT  r.ID, r.EventDate
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE ;)
                SELECT  t.ID, t.EventDate,
                        rn = ROW_NUMBER() OVER (ORDER BY t.EventDate)
                FROM    MyTable AS t
                JOIN    RecursiveCTE AS r
                        ON  r.ID < t.ID
                ) AS r
        WHERE   r.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0) ;

Workaround / solution 3: Partitioned Table

Available if you have Enterprise edition. More details about this possibility in the linked article: Query Processing Enhancements on Partitioned Tables and Indexes

The main disadvantage is that the skip scan will work only if ID is the partitioning column.


Workaround / solution 4: Additional Index

Adding an NCI index on (ID, EventDate) allows an index scan on the much smaller index. See @Daniel Hutmacher's answer for an explanation.

It's still a scan though and not (many) seeks so I'm not sure if it will scale as good as options 1 and 2. Of course everything depends on details (column sizes, number of distinct IDs vs number of repeated values, etc).


Workaround / solution (no, not working) 5: Indexed View

Nice idea, if it worked. Unfortunately a view cannot be indexed if it has GROUP BY and MIN/MAX, see restrictions of Indexed Views.

I wonder why, since it is allowed (and be indexed) if it has GROUP BY and COUNT_BIG(). There may be a Connect item about it, too!


Workaround / solution 6: "Do It Yourself" Indexed View

Implement an equivalent to an indexed view yourself. You could for example, add a MinEventDate column in MyTableIDs (the table with the distinct ID values, see option 1) and add triggers in MyTable that update this column accordingly. Then your query would be a simple SELECT from MyTableIDs.


Links to related web pages