Sql-server – Does SQL Server ignore STATISTICS for varchar columns as this article suggests

sql server

_{(I don't believe this question is a duplicate of this question from 8 years ago, as I'm not asking about the advantages of oversized columns, I'm asking about the behaviour demonstrated in the linked article below.)}

This recent (2017) article from SQLPerformance.com demonstrates how varying the maximum length n for a varchar(n) column affects query plan row-size estimates and sorting-buffer size estimates that can lead to subpar performance and memory allocation warnings.

In it, the author claims (emphasis mine):

From here we see that, the bigger the column definition, the higher the estimated row and data size. In this simple query, the I/O cost (0.0512731) is the same across all of the queries, regardless of definition, because the clustered index scan has to read all of the data anyway.

But there are other scenarios where this estimated row and total data size will have an impact: operations that require additional resources, such as sorts.

When I read that claim (in bold) I was surprised because I thought that SQL Server would get fairly accurate row-size estimates from its sampled STATISTICS objects maintained on those same tables. Especially given the SELECT AVG(LEN(email)) query in the article shows that no column has a value exceeding 77 characters.

The article also explicitly performs an ALTER INDEX ALL ON dbo.Table REBUILD – which this DB.SE posting says will will also automatically update STATISTICS.

(though I'm surprised that the word "statistics" doesn't appear anywhere in the SQLPerformance article at all – so maybe in the author's case the statistics weren't updated at all due to some machine-configuration and they didn't notice?)

Does SQL Server only use the varchar column length limit for row-size estimates? If not, then why does the SQLPerformance article describe the same?

Best Answer

Does SQL Server only use the varchar column length limit for row-size estimates? If not, then why does the SQLPerformance article describe the same?

Correct. SQL Server uses only the varchar (max specified) length when estimating row size. The SQLPerformance article accurately describes the estimated row size measurement.

The longer answer

In his example in the linked article, Aaron rebuilds all indexes to ensure all versions of the query have an equal playing field as far as both index size and statistics so that the execution plans for all cases are "ideal" and (as the experiment proved) almost equal, but not quite.

Statistics are used to estimate how many rows will be returned, not how much memory is granted for the execution of a query.

In the article, Aaron says (emphasis mine):

SQL Server has to take into account how much data might be in a column. It doesn't do this based on actually profiling the data, and it can't make any assumptions based on the <= 201 histogram step values. Instead, it has to estimate that every row holds a value half of the declared column size. So for a varchar(4000), it assumes every e-mail address is 2,000 characters long.

Aaron's reference to "the histogram step values" is a reference to the statistics histogram. The statistics histogram contains knowledge of at most 201 data values from the table. It knows the actual length of those (up to 201) explicit values, but it has no idea about the values in between those.

Additionally, Statistics are based on a sample of data, so there may be rows that were not analyzed as part of the sample, and relying on min/max/avg length of data from stats would be another opportunity for outdated or unrepresentative samples to adversely affect query execution.

Example code

/*Create temporary stored procedure with query of interest*/
CREATE PROC dbo.#plan_test
AS
SELECT *
FROM master..spt_values
WHERE low IS NULL

GO

/*Generate an execution plan*/
SET FMTONLY ON
EXEC dbo.#plan_test
SET FMTONLY OFF;

/*Retrieve the plan and parse the estimated rows*/
WITH  XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' as sql) 
SELECT  query_plan.value('(//sql:StmtSimple[1]/@StatementEstRows)[1]', 'float') AS EstimatedNumberOfRows
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
CROSS APPLY sys.dm_exec_query_plan(plan_handle) 
WHERE sys.dm_exec_sql_text.objectid=OBJECT_ID('tempdb.dbo.#plan_test') 
     AND sys.dm_exec_sql_text.dbid=2

DROP PROC dbo.#plan_test

Sql-server – How does SQL Server know predicates are correlated

Consider the simple AdventureWorks query and execution plan shown below. The query contains predicates connected with AND. The optimizer's cardinality estimate is 41,211 rows:

-- Estimate 41,211 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

Default execution plan

Using default statistics

Given only single-column statistics the optimizer produces this estimate by estimating the cardinality for each predicate separately, and multiplying the resulting selectivities together. This heuristic assumes that the predicates are completely independent.

Splitting the query into two parts makes the calculation easier to see:

-- Estimate 68,336.4 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336;

The Transaction History table contains 113,443 rows in total, so the 68,336.4 estimate represents a selectivity of 68336.4 / 113443 = 0.60238533 for this predicate. This estimate is obtained using the histogram information for the TransactionID column, and the constant values specified in the query.

-- Estimate 68,413 rows
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

This predicate has an estimated selectivity of 68413.0 / 113443 = 0.60306056. Again, it is calculated from the predicate's constant values and the histogram of the TransactionDate statistics object.

Assuming the predicates are completely independent, we can estimate the selectivity of the two predicates together by multiplying them together. The final cardinality estimate is obtained by multiplying the resulting selectivity by the 113,443 rows in the base table:

0.60238533 * 0.60306056 * 113443 = 41210.987

After rounding, this is the 41,211 estimate seen in the original query (the optimizer also uses floating point math internally).

Not a great estimate

The TransactionID and TransactionDate columns have a close correlation in the AdventureWorks data set (as monotonically increasing keys and date columns often do). This correlation means that the independence assumption is violated. As a consequence, the post-execution query plan shows 68,095 rows rather than the estimated 41,211:

Post-execution plan

Trace flag 4137

Enabling this trace flag changes the heuristics used to combine predicates. Instead of assuming complete independence, the optimizer considers that the selectivities of the two predicates are close enough that they are likely to be correlated:

-- Estimate 68,336.4
SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13'
OPTION (QUERYTRACEON 4137);

Recall that the TransactionID predicate alone estimated 68,336.4 rows and the TransactionDate predicate alone estimated 68,413 rows. The optimizer has chosen the lower of these two estimates rather than multiplying selectivities.

This is just a different heuristic, of course, but one that can help improve estimates for queries with correlated AND predicates. Each predicate is considered for possible correlation, and there are other adjustments made when many AND clauses are involved, but that example serves to show the basics of it.

Multi-column statistics

These can help in queries with correlations, but the histogram information is still based solely on the leading column of the statistics. The following candidate multi-column statistics therefore differ in an important way:

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionID TransactionDate]
ON Production.TransactionHistory
    (TransactionID, TransactionDate);

CREATE STATISTICS
    [stats Production.TransactionHistory TransactionDate TransactionID]
ON Production.TransactionHistory
    (TransactionDate, TransactionID);

Taking just one of those, we can see that the only extra information is the extra levels of the 'all' density. The histogram still only contains detailed information about the TransactionDate column.

DBCC SHOW_STATISTICS
    (
        'Production.TransactionHistory', 
        'stats Production.TransactionHistory TransactionDate TransactionID'
    );

Multi-column statstics

With these multi-column statistics in place...

SELECT COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE 
    TH.TransactionID BETWEEN 100000 AND 168336
    AND TH.TransactionDate BETWEEN '2007-09-01' AND '2008-03-13';

...the execution plan shows an estimate that is exactly the same as when only single-column statistics were available:

Multi-column statistics plan