Sql-server – Why does the SELECT DISTINCT TOP N query scan the entire table

optimizationsql serversql-server-2016

I've run into a few SELECT DISTINCT TOP N queries which appear to be poorly optimized by the SQL Server query optimizer. Let's start by considering a trivial example: a million row table with two alternating values. I'll use the GetNums function to generate the data:

DROP TABLE IF EXISTS X_2_DISTINCT_VALUES;

CREATE TABLE X_2_DISTINCT_VALUES (PK INT IDENTITY (1, 1), VAL INT NOT NULL);

INSERT INTO X_2_DISTINCT_VALUES WITH (TABLOCK) (VAL)
SELECT N % 2
FROM dbo.GetNums(1000000);

UPDATE STATISTICS X_2_DISTINCT_VALUES WITH FULLSCAN;

For the following query:

SELECT DISTINCT TOP 2 VAL
FROM X_2_DISTINCT_VALUES
OPTION (MAXDOP 1);

SQL Server can find two distinct values just by scanning the first data page of the table but it scans all of the data instead. Why doesn't SQL Server just scan until it finds the requested number of distinct values?

For this question please use the following test data that contains 10 million rows with 10 distinct values generated in blocks:

DROP TABLE IF EXISTS X_10_DISTINCT_HEAP;

CREATE TABLE X_10_DISTINCT_HEAP (VAL VARCHAR(10) NOT NULL);

INSERT INTO X_10_DISTINCT_HEAP WITH (TABLOCK)
SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
FROM dbo.GetNums(10000000);

UPDATE STATISTICS X_10_DISTINCT_HEAP WITH FULLSCAN;

Answers for a table with a clustered index are also acceptable:

DROP TABLE IF EXISTS X_10_DISTINCT_CI;

CREATE TABLE X_10_DISTINCT_CI (PK INT IDENTITY (1, 1), VAL VARCHAR(10) NOT NULL, PRIMARY KEY (PK));

INSERT INTO X_10_DISTINCT_CI WITH (TABLOCK) (VAL)
SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
FROM dbo.GetNums(10000000);

UPDATE STATISTICS X_10_DISTINCT_CI WITH FULLSCAN;

The following query scans all 10 million rows from the table. How can I get something that doesn't scan the entire table? I'm using SQL Server 2016 SP1.

SELECT DISTINCT TOP 10 VAL
FROM X_10_DISTINCT_HEAP
OPTION (MAXDOP 1);

Best Answer

There look to be three different optimizer rules that can perform the DISTINCT operation in the above query. The following query throws an error which suggests that the list is exhaustive:

SELECT DISTINCT TOP 10 ID
FROM X_10_DISTINCT_HEAP
OPTION (MAXDOP 1, QUERYRULEOFF GbAggToSort, QUERYRULEOFF GbAggToHS, QUERYRULEOFF GbAggToStrm);

Msg 8622, Level 16, State 1, Line 1

Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

GbAggToSort implements the group-by aggregate (distinct) as a distinct sort. This is a blocking operator that will read all of the data from the input before producing any rows. GbAggToStrm implements the group-by aggregate as a stream aggregate (which also requires an input sort in this instance). This is also a blocking operator. GbAggToHS implements as a hash match, which is what we saw in the bad plan from the question, but it can be implemented as hash match (aggregate) or hash match (flow distinct).

The hash match (flow distinct) operator is one way to solve this problem because it is not blocking. SQL Server should be able to stop the scan once it finds enough distinct values.

The Flow Distinct logical operator scans the input, removing duplicates. Whereas the Distinct operator consumes all input before producing any output, the Flow Distinct operator returns each row as it is obtained from the input (unless that row is a duplicate, in which case it is discarded).

Why does the query in the question use hash match (aggregate) instead of hash match (flow distinct)? As the number of distinct values changes in the table I would expect the cost of the hash match (flow distinct) query to decrease because the estimate of the number of rows it needs to scan to the table should decrease. I would expect the cost of the hash match (aggregate) plan to increase because the hash table that it needs to build will get bigger. One way to investigate this is by creating a plan guide. If I create two copies of the data but apply a plan guide to one of them I should be able to compare hash match (aggregate) to hash match (distinct) side by side against the same data. Note that I can't do this by disabling query optimizer rules because the same rule applies to both plans (GbAggToHS).

Here's one way to get the plan guide that I'm after:

DROP TABLE IF EXISTS X_PLAN_GUIDE_TARGET;

CREATE TABLE X_PLAN_GUIDE_TARGET (VAL VARCHAR(10) NOT NULL);

INSERT INTO X_PLAN_GUIDE_TARGET WITH (TABLOCK)
SELECT CAST(N % 10000 AS VARCHAR(10))
FROM dbo.GetNums(10000000);

UPDATE STATISTICS X_PLAN_GUIDE_TARGET WITH FULLSCAN;

-- run this query
SELECT DISTINCT TOP 10 VAL  FROM X_PLAN_GUIDE_TARGET  OPTION (MAXDOP 1)

Get the plan handle and use it to create a plan guide:

-- plan handle is 0x060007009014BC025097E88F6C01000001000000000000000000000000000000000000000000000000000000
SELECT qs.plan_handle, st.text FROM 
sys.dm_exec_query_stats AS qs   
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st  
WHERE st.text LIKE '%X[_]PLAN[_]GUIDE[_]TARGET%'
ORDER BY last_execution_time DESC;

EXEC sp_create_plan_guide_from_handle 
'EVIL_PLAN_GUIDE', 
0x060007009014BC025097E88F6C01000001000000000000000000000000000000000000000000000000000000;

Plan guides only work on the exact query text, so let's copy it back from the plan guide:

SELECT query_text
FROM sys.plan_guides
WHERE name = 'EVIL_PLAN_GUIDE';

Reset the data:

TRUNCATE TABLE X_PLAN_GUIDE_TARGET;

INSERT INTO X_PLAN_GUIDE_TARGET WITH (TABLOCK)
SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
FROM dbo.GetNums(10000000);

Get a query plan for the query with the plan guide applied:

SELECT DISTINCT TOP 10 VAL  FROM X_PLAN_GUIDE_TARGET  OPTION (MAXDOP 1)

This has the hash match (flow distinct) operator that we wanted with our test data. Note that SQL Server expects to read all of the rows from the table and that the estimated cost is the exact same as for the plan with the hash match (aggregate). The testing that I did suggested that the costs for the two plans are identical when the row goal for the plan is greater than or equal to the number of distinct values SQL Server expects from the table, which in this case can be simply derived from the statistics. Unfortunately (for our query) the optimizer picks the hash match (aggregate) over hash match (flow distinct) when the costs are the same. So we're 0.0000001 magic optimizer units away from the plan that we want.

One way to attack this problem is by decreasing the row goal. If the row goal from the optimizer's point of view is less than the distinct count of rows we'll probably get hash match (flow distinct). This can be accomplished with the OPTIMIZE FOR query hint:

DECLARE @j INT = 10;

SELECT DISTINCT TOP (@j) VAL
FROM X_10_DISTINCT_HEAP
OPTION (MAXDOP 1, OPTIMIZE FOR (@j = 1));

For this query the optimizer creates a plan as if the query just needs the first row but when the query is executed it gets back the first 10 rows. On my machine this query scans 892800 rows from X_10_DISTINCT_HEAP and completes in 299 ms with 250 ms of CPU time and 2537 logical reads.

Note that this technique will not work if the statistics report just one distinct value, which could happen for sampled statistics against skewed data. However, in that case it's unlikely that your data is packed densely enough to justify using techniques like this. You may not lose much by scanning all of the data in the table, especially if that can be done in parallel.

Another way to attack this problem is by inflating the number of estimated distinct values SQL Server expects to get from the base table. This was harder than expected. Applying a deterministic function cannot possibly increase the distinct count of results. If the query optimizer is aware of that mathematical fact (some testing suggests it is at least for our purposes) then applying deterministic functions (which includes all string functions) will not increase the estimated number of distinct rows.

Many of the nondeterministic functions did not work either, including the obvious choices of NEWID() and RAND(). However, LAG() does the trick for this query. The query optimizer expects 10 million distinct values against the LAG expression which will encourage a hash match (flow distinct) plan:

SELECT DISTINCT TOP 10 LAG(VAL, 0) OVER (ORDER BY (SELECT NULL)) AS ID
FROM X_10_DISTINCT_HEAP
OPTION (MAXDOP 1);

On my machine this query scans 892800 rows from X_10_DISTINCT_HEAP and completes in 1165 ms with 1109 ms of CPU time and 2537 logical reads, so the LAG() adds quite a bit of relative overhead. @Paul White suggested to try batch mode processing for this query. On SQL Server 2016 we can get batch mode processing even with MAXDOP 1. One way to get batch mode processing for a rowstore table is to join to an empty CCI as follows:

CREATE TABLE #X_DUMMY_CCI (ID INT NOT NULL);

CREATE CLUSTERED COLUMNSTORE INDEX X_DUMMY_CCI ON #X_DUMMY_CCI;

SELECT DISTINCT TOP 10 VAL
FROM
(
    SELECT LAG(VAL, 1) OVER (ORDER BY (SELECT NULL)) AS VAL
    FROM X_10_DISTINCT_HEAP
    LEFT OUTER JOIN #X_DUMMY_CCI ON 1 = 0
) t
WHERE t.VAL IS NOT NULL
OPTION (MAXDOP 1);

That code results in this query plan.

Paul pointed out that I had to change the query to use LAG(..., 1) because LAG(..., 0) doesn't appear to be eligible for the Window Aggregate optimization. This change reduced the elapsed time to 520 ms and the CPU time to 454 ms.

Note that the LAG() approach isn't the most stable one. If Microsoft changes the uniqueness assumption against the function then it may no longer work. It has a different estimate with the legacy CE. Also this type of optimization against a heap isn't necessary a good idea. If the table is rebuilt it's possible to end up in a worst case scenario in which almost all of the rows need to be read from the table.

Against a table with a unique column (such as the clustered index example in the question) we have better options. For example we can trick the optimizer by using a SUBSTRING expression that always returns an empty string. SQL Server does not think that the SUBSTRING will change the number of distinct values so if we apply it to a unique column, such as PK, then the estimated number of distinct rows is 10 million. This following query gets the hash match (flow distinct) operator:

SELECT DISTINCT TOP 10 VAL + SUBSTRING(CAST(PK AS VARCHAR(10)), 11, 1)
FROM X_10_DISTINCT_CI
OPTION (MAXDOP 1);

On my machine this query scans 900000 rows from X_10_DISTINCT_CI and completes in 333 ms with 297 ms of CPU time and 3011 logical reads.

In summary, the query optimizer appears to assume that all rows will be read from the table for SELECT DISTINCT TOP N queries when N >= the number of estimated distinct rows from the table. The hash match (aggregate) operator may have the same cost as the hash match (flow distinct) operator but the optimizer always picks the aggregate operator. This can lead to unnecessary logical reads when enough distinct values are located near the start of the table scan. Two ways to trick the optimizer into using the hash match (flow distinct) operator are to lower the row goal using the OPTIMIZE FOR hint or to increase the estimated number of distinct rows using LAG() or SUBSTRING on a unique column.