Sql-server – Threshold for sort spill and random cardinality estimates

optimizationquery-performancesortingsql serversql server 2014

I wanted to test tempdb spill warnings so I run the following script on SQL Server 2014:

USE tempdb

IF OBJECT_ID('tempdb..tblTest') IS NOT NULL DROP TABLE tblTest

CREATE TABLE tblTest
(
          c1 INT         PRIMARY KEY CLUSTERED,
          c2 INT        ,
          c3 CHAR (1000)
);

GO
SET NOCOUNT ON;

BEGIN TRANSACTION;

DECLARE @i AS INT;

SET @i = 1;

WHILE @i <= 10000
          BEGIN
                    INSERT  INTO tblTest (c1, c2, c3)
                    VALUES              (@i, @i, 'a');
                    SET @i = @i + 1;
          END

COMMIT TRANSACTION;

GO
UPDATE STATISTICS dbo.tblTest
GO
SET STATISTICS XML ON;
GO
--no tempdb spill (SQL Server 2014)
--in sql server 2012 always different "Estimated number of rows" if you run the whole script several times
SELECT   *
FROM     tblTest
WHERE    c1 <= 5948
ORDER BY c2
OPTION (MAXDOP 1);

GO
SET STATISTICS XML OFF;

SET STATISTICS XML ON;
GO
--no tempdb spill (SQL Server 2014)
SELECT   *
FROM     tblTest
WHERE    c1 <= 5949
ORDER BY c2
OPTION (MAXDOP 1);

GO
SET STATISTICS XML OFF;

(core of the query was based on one material from MS)

1) The first thing I am curious about is what causes the tempdb spill during sort operation at this specific level of c1 column. All the estimates are correct and count of the read page is the same for both queries. So why is the second query spilled? (In another words why memory grant for the later query is so much higher).

2) I have tested this query on SQL Server 2012 and got very interesting behaviour as well. Firstly I was unable to get to the same thresholds, so I run the script repeatedly and noticed that estimated number of rows is always different than it was in the previous run. My question is why is the estimated number of returned rows always different when I repeatedly run the very same script (creating, inserting a updating it's own statistics via full scan)?

Best Answer

The amount of memory needed to perform a sort is not as simple as computing the raw size of the input data. The main sorting algorithm used by SQL Server is a variation on merge sort, which includes extra steps like key normalization to ensure all combinations of data column types can be sorted efficiently. Due to these extra steps, it is not easy to predict the amount of memory that would be needed to avoid a spill at runtime; the memory grant is an estimate. Spilling is part of the design. You should not be overly concerned by small spills in edge cases like this.
SQL Server 2014 includes a new cardinality estimation module, which is used if the context database is compatibility level 120. You use tempdb for your test, which will be 120 level by default. You can get the pre-2014 CE behaviour by changing the compatibility level, or using trace flag 9481. Regarding the different number of estimated rows, the statistics may be sampled unless you use the WITH FULLSCAN option with UPDATE STATISTICS.

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Why does this query need a Row Count Spool operator? ... what specific optimization is it trying to provide?

The cust_nbr column in #existingCustomers is nullable. If it actually contains any nulls the correct response here is to return zero rows (NOT IN (NULL,...) will always yield an empty result set.).

So the query can be thought of as

SELECT p.*
FROM   #potentialNewCustomers p
WHERE  NOT EXISTS (SELECT *
                   FROM   #existingCustomers e1
                   WHERE  p.cust_nbr = e1.cust_nbr)
       AND NOT EXISTS (SELECT *
                       FROM   #existingCustomers e2
                       WHERE  e2.cust_nbr IS NULL)

With the rowcount spool there to avoid having to evaluate the

EXISTS (SELECT *
        FROM   #existingCustomers e2
        WHERE  e2.cust_nbr IS NULL)

More than once.

This just seems to be a case where a small difference in assumptions can make quite a catastrophic difference in performance.

After updating a single row as below...

UPDATE #existingCustomers
SET    cust_nbr = NULL
WHERE  cust_nbr = 1;

... the query completed in less than a second. The row counts in actual and estimated versions of the plan are now nearly spot on.

SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT *
FROM   #potentialNewCustomers
WHERE  cust_nbr NOT IN (SELECT cust_nbr
                        FROM   #existingCustomers 
                       )

Zero rows are output as described above.

The Statistics Histograms and auto update thresholds in SQL Server are not granular enough to detect this kind of single row change. Arguably if the column is nullable it might be reasonable to work on the basis that it contains at least one NULL even if the statistics histogram doesn't currently indicate that there are any.

Sql-server – SQL Server 2014 COUNT(DISTINCT x) ignores statistics density vector for column x

The way the cardinality estimation is derived certainly seems counter-intuitive to me. The distinct count calculation (viewable with Extended Events or trace flags 2363 and 3604) is:

Notice the cap. The general logic of this seems very reasonable (there can't be more distinct values), but the cap is applied from sampled multi-column statistics:

DBCC SHOW_STATISTICS 
    (BigFactTable, [PK_BigFactTable])
WITH
    STAT_HEADER, 
    DENSITY_VECTOR;

That shows 2,980,235 rows sampled out of 3,439,431,721 with a density vector at the Col5 level of 3.35544E-07. The reciprocal of that gives a number of distinct values of 2,980,235 rounded using real math to 2,980,240.

Now the question is, given sampled statistics, what assumptions the model should make about the number of distinct values. I would expect it to extrapolate, but that isn't done, and perhaps deliberately.

More intuitively, I would expect that instead of using the multi-column statistics, it would look at the density on Col5 (but it doesn't):

DBCC SHOW_STATISTICS 
    (BigFactTable, [_WA_Sys_00000005_24927208])
WITH
    STAT_HEADER, 
    DENSITY_VECTOR;

Here the density is 9.266754E-10, the reciprocal of which is 1,079,126,528.

One obvious workaround in the meantime is to update the multi-column statistics with full scan. The other is to use the original cardinality estimator.

The Connect item you opened, SQL 2014 sampled multi-column statistics override more accurate single-column statistics for non-leading columns, is marked Fixed for SQL Server 2017.

Best Answer

Related Solutions

Sql-server – Query 100x slower in SQL Server 2014, Row Count Spool row estimate the culprit

Sql-server – SQL Server 2014 COUNT(DISTINCT x) ignores statistics density vector for column x

Related Question