Sql-server – SQL Server sample Update of Statistics misses highest RANGE_HI_KEY on ascending key column

sql serversql-server-2012statistics

I'm trying to understand how Statistics sampling works and whether or not the below is expected behaviour on sampled statistics updates.

We have a large table partitioned by date with a couple of billion rows. The partition date is the prior business date and so is an ascending key. We only load data into this table for the prior day.

The data load runs overnight, so on Friday 8th April we loaded data for the 7th.

After each run we update statistics, although take a sample, rather than a FULLSCAN.

Maybe I am being naïve, but I would have expected SQL Server identify the highest key and lowest key in the range to ensure it got an accurate range sample. According to this article:

For the first bucket, the lower boundary is the smallest value of the column on which the histogram is generated.

However, it doesn't mention the last bucket/largest value.

With the sampled Statistics update on the morning of the 8th, the sample missed the highest value in the table (the 7th).

enter image description here

As we do a lot of querying on data from the prior day, this resulted in inaccurate cardinality estimation and a number of queries timing out.

Should SQL Server not identify the highest value for that key and use that as the maximum RANGE_HI_KEY? Or is this just one of the limits of update without using FULLSCAN?

Version SQL Server 2012 SP2-CU7. We cannot currently upgrade due to a change in OPENQUERY behaviour in SP3 that was rounding down numbers in a linked server query between SQL Server and Oracle.

Best Answer

Should SQL Server not identify the highest value for that key and use that as the maximum RANGE_HI_KEY? Or is this just one of the limits of update without using FULLSCAN?

It's a limitation of the current implementation of sampled statistics. As it stands, sampled statistics collection uses TABLESAMPLE SYSTEM, which uses an allocation-order scan and chooses pages from the scan to sample. Only chosen pages contribute to the histogram.

Since the scan is allocation-ordered (rather than index-ordered), there is no way to give preference to the first and last pages in key order.

For more information see this related question:

How does sampling work when updating statistics?

and my article, Allocation Order Scans

For workarounds, see Statistics on Ascending Columns by Fabiano Amorim