Sql-server – Does SQL Server ignore STATISTICS for varchar columns as this article suggests

sql server

(I don't believe this question is a duplicate of this question from 8 years ago, as I'm not asking about the advantages of oversized columns, I'm asking about the behaviour demonstrated in the linked article below.)

This recent (2017) article from SQLPerformance.com demonstrates how varying the maximum length n for a varchar(n) column affects query plan row-size estimates and sorting-buffer size estimates that can lead to subpar performance and memory allocation warnings.

In it, the author claims (emphasis mine):

From here we see that, the bigger the column definition, the higher the estimated row and data size. In this simple query, the I/O cost (0.0512731) is the same across all of the queries, regardless of definition, because the clustered index scan has to read all of the data anyway.

But there are other scenarios where this estimated row and total data size will have an impact: operations that require additional resources, such as sorts.

When I read that claim (in bold) I was surprised because I thought that SQL Server would get fairly accurate row-size estimates from its sampled STATISTICS objects maintained on those same tables. Especially given the SELECT AVG(LEN(email)) query in the article shows that no column has a value exceeding 77 characters.

The article also explicitly performs an ALTER INDEX ALL ON dbo.Table REBUILD – which this DB.SE posting says will will also automatically update STATISTICS.

(though I'm surprised that the word "statistics" doesn't appear anywhere in the SQLPerformance article at all – so maybe in the author's case the statistics weren't updated at all due to some machine-configuration and they didn't notice?)

Does SQL Server only use the varchar column length limit for row-size estimates? If not, then why does the SQLPerformance article describe the same?

Best Answer

Does SQL Server only use the varchar column length limit for row-size estimates? If not, then why does the SQLPerformance article describe the same?

Correct. SQL Server uses only the varchar (max specified) length when estimating row size. The SQLPerformance article accurately describes the estimated row size measurement.

The longer answer

In his example in the linked article, Aaron rebuilds all indexes to ensure all versions of the query have an equal playing field as far as both index size and statistics so that the execution plans for all cases are "ideal" and (as the experiment proved) almost equal, but not quite.

Statistics are used to estimate how many rows will be returned, not how much memory is granted for the execution of a query.

In the article, Aaron says (emphasis mine):

SQL Server has to take into account how much data might be in a column. It doesn't do this based on actually profiling the data, and it can't make any assumptions based on the <= 201 histogram step values. Instead, it has to estimate that every row holds a value half of the declared column size. So for a varchar(4000), it assumes every e-mail address is 2,000 characters long.

Aaron's reference to "the histogram step values" is a reference to the statistics histogram. The statistics histogram contains knowledge of at most 201 data values from the table. It knows the actual length of those (up to 201) explicit values, but it has no idea about the values in between those.

Additionally, Statistics are based on a sample of data, so there may be rows that were not analyzed as part of the sample, and relying on min/max/avg length of data from stats would be another opportunity for outdated or unrepresentative samples to adversely affect query execution.