I've run into a few SELECT DISTINCT TOP N
queries which appear to be poorly optimized by the SQL Server query optimizer. Let's start by considering a trivial example: a million row table with two alternating values. I'll use the GetNums function to generate the data:
DROP TABLE IF EXISTS X_2_DISTINCT_VALUES;
CREATE TABLE X_2_DISTINCT_VALUES (PK INT IDENTITY (1, 1), VAL INT NOT NULL);
INSERT INTO X_2_DISTINCT_VALUES WITH (TABLOCK) (VAL)
SELECT N % 2
FROM dbo.GetNums(1000000);
UPDATE STATISTICS X_2_DISTINCT_VALUES WITH FULLSCAN;
For the following query:
SELECT DISTINCT TOP 2 VAL
FROM X_2_DISTINCT_VALUES
OPTION (MAXDOP 1);
SQL Server can find two distinct values just by scanning the first data page of the table but it scans all of the data instead. Why doesn't SQL Server just scan until it finds the requested number of distinct values?
For this question please use the following test data that contains 10 million rows with 10 distinct values generated in blocks:
DROP TABLE IF EXISTS X_10_DISTINCT_HEAP;
CREATE TABLE X_10_DISTINCT_HEAP (VAL VARCHAR(10) NOT NULL);
INSERT INTO X_10_DISTINCT_HEAP WITH (TABLOCK)
SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
FROM dbo.GetNums(10000000);
UPDATE STATISTICS X_10_DISTINCT_HEAP WITH FULLSCAN;
Answers for a table with a clustered index are also acceptable:
DROP TABLE IF EXISTS X_10_DISTINCT_CI;
CREATE TABLE X_10_DISTINCT_CI (PK INT IDENTITY (1, 1), VAL VARCHAR(10) NOT NULL, PRIMARY KEY (PK));
INSERT INTO X_10_DISTINCT_CI WITH (TABLOCK) (VAL)
SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
FROM dbo.GetNums(10000000);
UPDATE STATISTICS X_10_DISTINCT_CI WITH FULLSCAN;
The following query scans all 10 million rows from the table. How can I get something that doesn't scan the entire table? I'm using SQL Server 2016 SP1.
SELECT DISTINCT TOP 10 VAL
FROM X_10_DISTINCT_HEAP
OPTION (MAXDOP 1);
Best Answer
There look to be three different optimizer rules that can perform the
DISTINCT
operation in the above query. The following query throws an error which suggests that the list is exhaustive:GbAggToSort
implements the group-by aggregate (distinct) as a distinct sort. This is a blocking operator that will read all of the data from the input before producing any rows.GbAggToStrm
implements the group-by aggregate as a stream aggregate (which also requires an input sort in this instance). This is also a blocking operator.GbAggToHS
implements as a hash match, which is what we saw in the bad plan from the question, but it can be implemented as hash match (aggregate) or hash match (flow distinct).The hash match (flow distinct) operator is one way to solve this problem because it is not blocking. SQL Server should be able to stop the scan once it finds enough distinct values.
Why does the query in the question use hash match (aggregate) instead of hash match (flow distinct)? As the number of distinct values changes in the table I would expect the cost of the hash match (flow distinct) query to decrease because the estimate of the number of rows it needs to scan to the table should decrease. I would expect the cost of the hash match (aggregate) plan to increase because the hash table that it needs to build will get bigger. One way to investigate this is by creating a plan guide. If I create two copies of the data but apply a plan guide to one of them I should be able to compare hash match (aggregate) to hash match (distinct) side by side against the same data. Note that I can't do this by disabling query optimizer rules because the same rule applies to both plans (
GbAggToHS
).Here's one way to get the plan guide that I'm after:
Get the plan handle and use it to create a plan guide:
Plan guides only work on the exact query text, so let's copy it back from the plan guide:
Reset the data:
Get a query plan for the query with the plan guide applied:
This has the hash match (flow distinct) operator that we wanted with our test data. Note that SQL Server expects to read all of the rows from the table and that the estimated cost is the exact same as for the plan with the hash match (aggregate). The testing that I did suggested that the costs for the two plans are identical when the row goal for the plan is greater than or equal to the number of distinct values SQL Server expects from the table, which in this case can be simply derived from the statistics. Unfortunately (for our query) the optimizer picks the hash match (aggregate) over hash match (flow distinct) when the costs are the same. So we're 0.0000001 magic optimizer units away from the plan that we want.
One way to attack this problem is by decreasing the row goal. If the row goal from the optimizer's point of view is less than the distinct count of rows we'll probably get hash match (flow distinct). This can be accomplished with the
OPTIMIZE FOR
query hint:For this query the optimizer creates a plan as if the query just needs the first row but when the query is executed it gets back the first 10 rows. On my machine this query scans 892800 rows from
X_10_DISTINCT_HEAP
and completes in 299 ms with 250 ms of CPU time and 2537 logical reads.Note that this technique will not work if the statistics report just one distinct value, which could happen for sampled statistics against skewed data. However, in that case it's unlikely that your data is packed densely enough to justify using techniques like this. You may not lose much by scanning all of the data in the table, especially if that can be done in parallel.
Another way to attack this problem is by inflating the number of estimated distinct values SQL Server expects to get from the base table. This was harder than expected. Applying a deterministic function cannot possibly increase the distinct count of results. If the query optimizer is aware of that mathematical fact (some testing suggests it is at least for our purposes) then applying deterministic functions (which includes all string functions) will not increase the estimated number of distinct rows.
Many of the nondeterministic functions did not work either, including the obvious choices of
NEWID()
andRAND()
. However,LAG()
does the trick for this query. The query optimizer expects 10 million distinct values against theLAG
expression which will encourage a hash match (flow distinct) plan:On my machine this query scans 892800 rows from
X_10_DISTINCT_HEAP
and completes in 1165 ms with 1109 ms of CPU time and 2537 logical reads, so theLAG()
adds quite a bit of relative overhead. @Paul White suggested to try batch mode processing for this query. On SQL Server 2016 we can get batch mode processing even withMAXDOP 1
. One way to get batch mode processing for a rowstore table is to join to an empty CCI as follows:That code results in this query plan.
Paul pointed out that I had to change the query to use
LAG(..., 1)
becauseLAG(..., 0)
doesn't appear to be eligible for the Window Aggregate optimization. This change reduced the elapsed time to 520 ms and the CPU time to 454 ms.Note that the
LAG()
approach isn't the most stable one. If Microsoft changes the uniqueness assumption against the function then it may no longer work. It has a different estimate with the legacy CE. Also this type of optimization against a heap isn't necessary a good idea. If the table is rebuilt it's possible to end up in a worst case scenario in which almost all of the rows need to be read from the table.Against a table with a unique column (such as the clustered index example in the question) we have better options. For example we can trick the optimizer by using a
SUBSTRING
expression that always returns an empty string. SQL Server does not think that theSUBSTRING
will change the number of distinct values so if we apply it to a unique column, such as PK, then the estimated number of distinct rows is 10 million. This following query gets the hash match (flow distinct) operator:On my machine this query scans 900000 rows from
X_10_DISTINCT_CI
and completes in 333 ms with 297 ms of CPU time and 3011 logical reads.In summary, the query optimizer appears to assume that all rows will be read from the table for
SELECT DISTINCT TOP N
queries whenN
>= the number of estimated distinct rows from the table. The hash match (aggregate) operator may have the same cost as the hash match (flow distinct) operator but the optimizer always picks the aggregate operator. This can lead to unnecessary logical reads when enough distinct values are located near the start of the table scan. Two ways to trick the optimizer into using the hash match (flow distinct) operator are to lower the row goal using theOPTIMIZE FOR
hint or to increase the estimated number of distinct rows usingLAG()
orSUBSTRING
on a unique column.