Sql-server – Statistics histogram AVG_RANGE_ROWS discrepancy

sql serversql server 2014statistics

According to MS docs, the description for AVG_RANGE_ROWS is:

Average number of rows with duplicate column values within a histogram step, excluding the upper bound. When DISTINCT_RANGE_ROWS is greater than 0, AVG_RANGE_ROWS is calculated by dividing RANGE_ROWS by DISTINCT_RANGE_ROWS. When DISTINCT_RANGE_ROWS is 0, AVG_RANGE_ROWS returns 1 for the histogram step.

I'm looking at the last line and if that is indeed the case, I'm curious to know why I'm seeing a value for AVG_RANGE_ROWS that is not equals 1 when DISTINCT_RANGE_ROWS is 0 in the histogram steps.

The statistic in question is a column stat created by SQL Server when the auto create statistics option is turned on. I'm on an older version of the database, but on the latest patch – SQL Server 2014 SP3, CU4+GDR (12.0.6372.1).

It is a bit unfortunate that we almost had a meltdown last week because of a sub-optimal query plan. The end result result was large scans and bloated memory grants. Resampling the statistic with a higher percent value sort of resolved the issue for us for the time being but I'm mostly curious to know if there are exceptions around the initial statement or a known issue (perhaps addressed using a trace flag?) and how do I prevent this from happening again for auto created stats where we have no control over the sampling size?

Best Answer

As described in answer to Badly-formed histogram causes bad estimates on Nested Loop there has been a change to the way sampled statistics are calculated and stored, more particularly when scaling is applied.

As a side-effect, the value for DISTINCT_RANGE_ROWS in your case is a fraction between 0 and 1 (980.235 / 386212.6 = 0.002538071). The column has an exposed type of bigint, so it is rounded down to zero.

Clearly there can't really be zero distinct values when the range contains a non-zero numbers of rows.

One can only hope these discrepancies are cleared up at some stage; though it's hard to imagine exactly what that might look like without a potentially breaking change of data type, extending also to sys.dm_db_stats_histogram (available on SQL Server 2016 and later).

As for what you do about it, if you are convinced this is not just a display issue and is actually causing poor estimations, you should report it as a regression.

Best Answer

Related Solutions

Sql-server – Multicolumn statistics on large table yielding poor estimations

Sql-server – Clustered index scan appears to be costed too low with row count spool

Related Question