I have a table like this:
CREATE TABLE Updates
(
UpdateId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
ObjectId INT NOT NULL
)
Essentially tracking updates to objects with an increasing ID.
The consumer of this table will select a chunk of 100 distinct object IDs, ordered by UpdateId
and starting from a specific UpdateId
. Essentially, keeping track of where it left off and then querying for any updates.
I've found this to be an interesting optimization problem because I've only been able to generate a maximally optimal query plan by writing queries that happen to do what I want due to indexes, but do not guarantee what I want:
SELECT DISTINCT TOP 100 ObjectId
FROM Updates
WHERE UpdateId > @fromUpdateId
Where @fromUpdateId
is a stored procedure parameter.
With a plan of:
SELECT <- TOP <- Hash match (flow distinct, 100 rows touched) <- Index seek
Due to the seek on the UpdateId
index being used, the results are already nice and ordered from lowest to highest update ID like I want. And this generates a flow distinct plan, which is what I want. But the ordering obviously isn't guaranteed behavior, so I don't want to use it.
This trick also results in the same query plan (though with a redundant TOP):
WITH ids AS
(
SELECT ObjectId
FROM Updates
WHERE UpdateId > @fromUpdateId
ORDER BY UpdateId OFFSET 0 ROWS
)
SELECT DISTINCT TOP 100 ObjectId FROM ids
Though, I'm not sure (and suspect not) if this truly guarantees ordering.
One query I hoped SQL Server would be smart enough to simplify was this, but it ends up generating a very bad query plan:
SELECT TOP 100 ObjectId
FROM Updates
WHERE UpdateId > @fromUpdateId
GROUP BY ObjectId
ORDER BY MIN(UpdateId)
With a plan of:
SELECT <- Top N Sort <- Hash Match aggregate (50,000+ rows touched) <- Index Seek
I'm trying to find a way to generate an optimal plan with an index seek on UpdateId
and a flow distinct to remove duplicate ObjectId
s. Any ideas?
Sample data if you want it. Objects will rarely have more than one update, and should almost never have more than one within a set of 100 rows, which is why I'm after a flow distinct, unless there's something better I don't know of? However, there is no guarantee that a single ObjectId
won't have more than 100 rows in the table. The table has over 1,000,000 rows and is expected to grow rapidly.
Assume the user of this has another way to find the appropriate next @fromUpdateId
. No need to return it in this query.
Best Answer
The SQL Server optimizer cannot produce the execution plan you are after with the guarantee you need, because the Hash Match Flow Distinct operator is not order-preserving.
You may observe order preservation in many cases, but this is an implementation detail; there is no guarantee, so you cannot rely on it. As always, presentation order can only be guaranteed by a top-level
ORDER BY
clause.Example
The script below shows that Hash Match Flow Distinct does not preserve order. It sets up the table in question with matching numbers 1-50,000 in both columns:
The test query is:
The estimated plan shows an index seek and flow distinct:
The output certainly seems ordered to start with:
...but further down values start to go 'missing':
...and eventually:
The explanation in this particular case, is that the hash operator spills:
Once a partition spills, all rows that hash to the same partition also spill. Spilled partitions are processed later, breaking the expectation that distinct values encountered will be emitted immediately in the sequence they are received.
There are many ways to write an efficient query to produce the ordered result you want, such as recursion or using a cursor. However, it cannot be done using Hash Match Flow Distinct.