Sql-server – SQL Server 2014 COUNT(DISTINCT x) ignores statistics density vector for column x

cardinality-estimatessql serversql server 2014

For a COUNT(DISTINCT) that has ~1 billion distinct values, I'm getting a query plan with a hash aggregate estimated to have only ~3 million rows.

Why is this happening? SQL Server 2012 produces a good estimate, so is this a bug in SQL Server 2014 that I should report on Connect?

The query and poor estimate

-- Actual rows: 1,011,719,166
-- SQL 2012 estimated rows: 1,079,130,000 (106% of actual)
-- SQL 2014 estimated rows: 2,980,240 (0.29% of actual)
SELECT COUNT(DISTINCT factCol5)
FROM BigFactTable
OPTION (RECOMPILE, QUERYTRACEON 9481) -- Include this line to use SQL 2012 CE

-- Stats for the factCol5 column show that there are ~1 billion distinct values
-- This is a good estimate, and it appears to be what the SQL 2012 CE uses
DBCC SHOW_STATISTICS (BigFactTable, _WA_Sys_00000005_24927208)
--All density   Average Length  Columns
--9.266754E-10  8               factCol5
SELECT 1 / 9.266754E-10
-- 1079126520.46229

The query plan

Full script

Here is a full repro of the situation using a stats only database.

What I've tried so far

I dug into the statistics for the relevant column and found that the density vector shows an estimated ~1.1 billion distinct values. SQL Server 2012 uses this estimate and produces a good plan. SQL Server 2014, surprisingly, appears to ignore the very accurate estimate provided by the statistics and instead uses a much lower estimate. This produces a much slower plan that does not reserve nearly enough memory and spills to tempdb.

I tried trace flag 4199, but that did not fix the situation. Lastly, I tried to dig in to optimizer information via a combination of trace flags (3604, 8606, 8607, 8608, 8612), as demonstrated in the second half of this article. However, I wasn't able to see any information explaining the bad estimate until it appeared in the final output tree.

Connect issue

Based on the answers to this question, I have also filed this as an issue in Connect

Best Answer

The way the cardinality estimation is derived certainly seems counter-intuitive to me. The distinct count calculation (viewable with Extended Events or trace flags 2363 and 3604) is:

Notice the cap. The general logic of this seems very reasonable (there can't be more distinct values), but the cap is applied from sampled multi-column statistics:

DBCC SHOW_STATISTICS 
    (BigFactTable, [PK_BigFactTable])
WITH
    STAT_HEADER, 
    DENSITY_VECTOR;

That shows 2,980,235 rows sampled out of 3,439,431,721 with a density vector at the Col5 level of 3.35544E-07. The reciprocal of that gives a number of distinct values of 2,980,235 rounded using real math to 2,980,240.

Now the question is, given sampled statistics, what assumptions the model should make about the number of distinct values. I would expect it to extrapolate, but that isn't done, and perhaps deliberately.

More intuitively, I would expect that instead of using the multi-column statistics, it would look at the density on Col5 (but it doesn't):

DBCC SHOW_STATISTICS 
    (BigFactTable, [_WA_Sys_00000005_24927208])
WITH
    STAT_HEADER, 
    DENSITY_VECTOR;

Here the density is 9.266754E-10, the reciprocal of which is 1,079,126,528.

One obvious workaround in the meantime is to update the multi-column statistics with full scan. The other is to use the original cardinality estimator.

The Connect item you opened, SQL 2014 sampled multi-column statistics override more accurate single-column statistics for non-leading columns, is marked Fixed for SQL Server 2017.

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL)

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL)

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       )

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.

Sql-server – Why does LEN() function badly underestimate cardinality in SQL Server 2014

For the legacy CE, I see the estimate is for 3.16228 % of the rows – and that is a "magic number" heuristic used for column = literal predicates (there are other heuristics based on predicate construction – but the LEN wrapped around the column for the legacy CE results matches this guess-framework). You can see examples of this on a post on Selectivity Guesses in absence of Statistics by Joe Sack, and Constant-Constant Comparison Estimation by Ian Jose.

-- Legacy CE: 31622.8 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 9481); -- Legacy CE
GO

Now as for the new CE behavior, it looks like this is now visible to the optimizer (which means we can use statistics). I went through the exercise of looking at the calculator output below, and you can look at the associated auto-generation of stats as a pointer:

-- New CE: 1.00007 rows
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  ( QUERYTRACEON 2312 ); -- New CE
GO

-- View New CE behavior with 2363 (for supported option use XEvents)
SELECT  COUNT(*)
FROM    #customers
WHERE   LEN(cust_nbr) = 6
OPTION  (QUERYTRACEON 2312, QUERYTRACEON 2363, QUERYTRACEON 3604, RECOMPILE); -- New CE
GO

/*
Loaded histogram for column QCOL:
[tempdb].[dbo].[#customers].cust_nbr from stats with id 2
Using ambient cardinality 1e+006 to combine distinct counts:
  999927
 
Combined distinct count: 999927
Selectivity: 1.00007e-006
Stats collection generated:
  CStCollFilter(ID=2, CARD=1.00007)
      CStCollBaseTable(ID=1, CARD=1e+006 TBL: #customers)
 
End selectivity computation
*/
 
EXEC tempdb..sp_helpstats '#customers';


--Check out AVG_RANGE_ROWS values (for example - plenty of ~ 1)
DBCC SHOW_STATISTICS('tempdb..#customers', '_WA_Sys_00000001_B0368087');
--That's my Stats name yours is subject to change

Unfortunately the logic relies on an estimate of the number of distinct values, which is not adjusted for the effect of the LEN function.

Possible workaround

You can get a trie-based estimate under both CE models by rewriting the LEN as a LIKE:

SELECT COUNT_BIG(*)
FROM #customers AS C
WHERE C.cust_nbr LIKE REPLICATE('_', 6);

Information on Trace Flags used:

2363: shows a lot of information, including statistics being loaded.
3604: prints the output of DBCC commands to the messages tab.

Best Answer

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Sql-server – Why does LEN() function badly underestimate cardinality in SQL Server 2014

Possible workaround

Related Question