SQL Server Performance – Forcing Flow Distinct Optimization

optimizationperformancequery-performancesql serversql server 2014

I have a table like this:

CREATE TABLE Updates
(
    UpdateId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
    ObjectId INT NOT NULL
)

Essentially tracking updates to objects with an increasing ID.

The consumer of this table will select a chunk of 100 distinct object IDs, ordered by UpdateId and starting from a specific UpdateId. Essentially, keeping track of where it left off and then querying for any updates.

I've found this to be an interesting optimization problem because I've only been able to generate a maximally optimal query plan by writing queries that happen to do what I want due to indexes, but do not guarantee what I want:

SELECT DISTINCT TOP 100 ObjectId
FROM Updates
WHERE UpdateId > @fromUpdateId

Where @fromUpdateId is a stored procedure parameter.

With a plan of:

SELECT <- TOP <- Hash match (flow distinct, 100 rows touched) <- Index seek

Due to the seek on the UpdateId index being used, the results are already nice and ordered from lowest to highest update ID like I want. And this generates a flow distinct plan, which is what I want. But the ordering obviously isn't guaranteed behavior, so I don't want to use it.

This trick also results in the same query plan (though with a redundant TOP):

WITH ids AS
(
    SELECT ObjectId
    FROM Updates
    WHERE UpdateId > @fromUpdateId
    ORDER BY UpdateId OFFSET 0 ROWS
)
SELECT DISTINCT TOP 100 ObjectId FROM ids

Though, I'm not sure (and suspect not) if this truly guarantees ordering.

One query I hoped SQL Server would be smart enough to simplify was this, but it ends up generating a very bad query plan:

SELECT TOP 100 ObjectId
FROM Updates
WHERE UpdateId > @fromUpdateId
GROUP BY ObjectId
ORDER BY MIN(UpdateId)

With a plan of:

SELECT <- Top N Sort <- Hash Match aggregate (50,000+ rows touched) <- Index Seek

I'm trying to find a way to generate an optimal plan with an index seek on UpdateId and a flow distinct to remove duplicate ObjectIds. Any ideas?

Sample data if you want it. Objects will rarely have more than one update, and should almost never have more than one within a set of 100 rows, which is why I'm after a flow distinct, unless there's something better I don't know of? However, there is no guarantee that a single ObjectId won't have more than 100 rows in the table. The table has over 1,000,000 rows and is expected to grow rapidly.

Assume the user of this has another way to find the appropriate next @fromUpdateId. No need to return it in this query.

Best Answer

The SQL Server optimizer cannot produce the execution plan you are after with the guarantee you need, because the Hash Match Flow Distinct operator is not order-preserving.

Though, I'm not sure (and suspect not) if this truly guarantees ordering.

You may observe order preservation in many cases, but this is an implementation detail; there is no guarantee, so you cannot rely on it. As always, presentation order can only be guaranteed by a top-level ORDER BY clause.

Example

The script below shows that Hash Match Flow Distinct does not preserve order. It sets up the table in question with matching numbers 1-50,000 in both columns:

IF OBJECT_ID(N'dbo.Updates', N'U') IS NOT NULL
    DROP TABLE dbo.Updates;
GO
CREATE TABLE Updates
(
    UpdateId INT NOT NULL IDENTITY(1,1),
    ObjectId INT NOT NULL,

    CONSTRAINT PK_Updates_UpdateId PRIMARY KEY (UpdateId)
);
GO
INSERT dbo.Updates (ObjectId)
SELECT TOP (50000)
    ObjectId =
        ROW_NUMBER() OVER (
            ORDER BY C1.[object_id]) 
FROM sys.columns AS C1
CROSS JOIN sys.columns AS C2
ORDER BY
    ObjectId;

The test query is:

DECLARE @Rows bigint = 50000;

-- Optimized for 1 row, but will be 50,000 when executed
SELECT DISTINCT TOP (@Rows)
    U.ObjectId 
FROM dbo.Updates AS U
WHERE 
    U.UpdateId > 0
OPTION (OPTIMIZE FOR (@Rows = 1));

The estimated plan shows an index seek and flow distinct:

Estimated plan

The output certainly seems ordered to start with:

Start of results

...but further down values start to go 'missing':

Pattern breaking down

...and eventually:

Chaos breaks out

The explanation in this particular case, is that the hash operator spills:

Post-execution plan

Once a partition spills, all rows that hash to the same partition also spill. Spilled partitions are processed later, breaking the expectation that distinct values encountered will be emitted immediately in the sequence they are received.


There are many ways to write an efficient query to produce the ordered result you want, such as recursion or using a cursor. However, it cannot be done using Hash Match Flow Distinct.