A custom statistic exists for the CacheId column of a table. After an overnight statistics update:
Statistics for INDEX 'ST_TableName_CacheId'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ST_TableName_CacheId Apr 26 2014 2:04AM 121482 121482 6 0 4 NO 121482
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.1666667 4 CacheId
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
39968 0 20247 0 1
40058 0 20247 0 1
40062 0 20247 0 1
40066 0 20247 0 1
40069 0 20247 0 1
41033 0 20247 0 1
1) Performance of a join against an existing data set in this table where CacheId = 41033 perform well with good estimates (23622 vs actual of 20247).
2) Then an insert is performed with CacheId = 41273 of 20247 rows.
3) Then a join against this newly inserted data set shows a poor estimate of 1 row resulting in a bad plan.
4) A manual update of the statistics (which was originally with fullscan) shows a new histogram:
Statistics for INDEX 'ST_TableName_CacheId'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ST_TableName_CacheId Apr 28 2014 10:41AM 141729 141729 7 0 4 NO 141729
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.1428571 4 CacheId
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
39968 0 20247 0 1
40058 0 20247 0 1
40062 0 20247 0 1
40066 0 20247 0 1
40069 0 20247 0 1
41033 0 20247 0 1
41274 0 20247 0 1
5) Running the same join query again for the CacheId = 41274 shows perfect estimates (20247) and good performance.
Q1) Why mathematically is the original estimate so bad? I mean the CacheId's are sparse but not at a ratio of 20000:1.
Q2) As the number of cacheId's increases would you expect the estimates for newly inserted data improve naturally?
Q3) Are there any ways (gulp, tricks or otherwise) to improve the estimate (or make it less certain of 1 row) without having to update the statistics every time a new set of data is inserted (e.g. adding a fake data set at a much larger CacheId = 999999).
Here are the true number of rows for all CacheId's within the table:
CacheId Rows
39968 20247
40058 20247
40062 20247
40066 20247
40069 20247
41033 20247
41274 20247
[ I don't think QP's are needed to answer this question and its some work to clean them up. I can answer specific questions if needed! ]
Best Answer
Here is the rule to trigger auto update the stats Statistical maintenance functionality (autostats) in SQL Server:
Even thought the KB point to 2000, it's still true up to 2012.
Run through this scenario and see for yourself.
STEP#1
Now we have a table with IDs 1 through 6 and each ID has 20247 rows. Stats look good so far!
STEP#2
Look at the table and histogram! The actual table has ID = 7 with 20247 rows but the histogram has no idea that you've just inserted the new data because the auto update didn't trigger. According the the formula you need to insert (20247 * 6) * 0.2 + 500 = 24,796.4 rows to trigger an auto update for stats on this table.
Thus, if you look at the plans for these queries you see the wrong estimates:
Query #1:
Query #2:
The Optimize cannot say 0 rows, so it just shows you 1.
STEP#3
Now the histogram show the missing ID 7 and the execution plans show the right estimates as well.
Query #1:
Query #2:
Yes, as soon as you pass the threshold of 20% + 500 from the total rows. The auto update will trigger. You can run though this scenario by re-running STEP#1, but then modify STEP#2 by running these queries:
No update yet because the threshold is 24,796.4 - 20247 = 4549.4 but we inserted only 4548 rows for ID 8. Now insert this one row and double check the histogram:
Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server
Hope this helped you to understand! Pretty good question!