A NC-index gets a totally different statistical distribution when estimated with sampling vs fullscan; the sampled one having a bizarre density vector. This results in poor execution plans.
I have a table of ~27M rows, with a non null FK-column supported by a nonclustered index. The table is clustered on its primary key. Both columns are varchar.
A fullscan statistics update for our FK-column gives a normal looking density vector:
All density Average Length Columns
6,181983E-08 45,99747 INSTANCEELEMENTID
3,615442E-08 95,26874 INSTANCEELEMENTID, ID
That is, we're expected to read about 1.7 rows for each distinct INSTANCELEMENTID
we are joining with.
A typical bin from the histogram looks like this:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR 133053 10 71366 1,679318
However, if we do a sampled update (using the default sample number which is 230k rows for this table) things take a turn to the bizarre:
4,773657E-06 45,99596 INSTANCEELEMENTID
3,702179E-08 95,30183 INSTANCEELEMENTID, ID
The density on INSTANCEELEMENTID
is now two orders of magnitude larger. (The density for both columns however has been estimated to a quite acceptable value).
A typical bin from the histogram now looks like this;
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR 143870,4 766,2573 1247 115,3596
ZOTZOT 131560,7 1 969 135,7092
which is a completely different distribution. Note that the INSTANCEELEMENTID
with the highest number of associated ID
s has 12, the most common number is 1. It's also very strange that some bins get EQ_ROWS = 1, this happens to about 10% of the bins.
There is no "unlucky" draw of strange rows that could contribute to this.
Am I reading the histogram correctly? Doesn't it look like the sampling has somehow scaled EQ_ROWS, DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS incorrectly?
The table is, as far as I can tell, unskewed. I've tried to emulate the sampler by estimating values myself with tablesample
. Counting these results in a normal way gives results that agree with the fullscan version, not the sampler.
Furthermore, I've not been able to reproduce this behaviour on clustered indices.
I've narrowed this down to this to reproduce:
CREATE TABLE F_VAL (
id varchar(100) primary key,
num_l_val int not null
)
set nocount on
declare @rowlimit integer = 20000000;
The table must be sufficiently large for this to be observed. I've seen this with uniqueidentifer
and varchar(100)
but not int
.
declare @i integer = 1;
declare @r float = rand()
while @i < @rowlimit
begin
set @r = rand()
insert f_val (id,num_l_val)
values (
cast(@i as varchar(100)) + REPLICATE('f', 40 - len(@i)),
case when @r > 0.8 then 4 when @r > 0.5 then 3 when @r > 0.4 then 2 else 1 end
)
set @i = @i + 1
end
create table k_val (
id int identity primary key,
f_val varchar(100) not null,
)
insert into k_val(f_val)
select id from F_VAL
union all select id from f_val where num_l_val - 1 = 1
union all select id from f_val where num_l_val - 2 = 1
union all select id from f_val where num_l_val - 3 = 1
order by id
create nonclustered index IX_K_VAL_F_VAL ON K_VAL (F_VAL)
update statistics K_VAL(IX_K_VAL_F_VAL)
dbcc show_statistics (k_val,IX_k_VAL_F_VAL)
update statistics K_VAL(IX_K_VAL_F_VAL) WITH FULLSCAN
dbcc show_statistics (k_val,IX_k_VAL_F_VAL)
Compare the two statistics; the one with sampling is now representing a total different density vector and the histogram bins are off. Note that the table is not skewed.
Using int
as datatype does not cause this, does SQL Server not examine the entire datapoint when using varchar
?
It's worthwhile to mention that the problem seems to scale, increasing sample rate helps.
Best Answer
I've seen this same density problem on some of the nonclustered indexes on the largest databases that I have access to. First I'll start with a few observations that I've made about histograms and density calculations:
StatMan
function makes inferences about the data that is missing. The behavior can change depending on the data type of the column.For one way to look at the problem, suppose that you sample 100 rows from a 10000 row table and you get 100 distinct values. One guess at what the rest of the data in the table is that there are 10000 unique values. Another guess is that there are 100 distinct values but each of the values are repeated 100 times. The second guess may seem unreasonable to you, which I will agree with. However, how do you balance the two approaches when the sampled data comes back unevenly distributed? There is some set of algorithms developed for this by Microsoft contained in the
StatMan
function. The algorithms may not work for all data disruptions and all sample levels.Let's go through a relatively simple example. I'm going to use
VARCHAR
columns like in your table to see some of the same behavior. However, I'll just add one skewed value to the table. I'm testing against SQL Server 2016 SP1. Start with 100k rows with 100k unique values for theFK
column:Here are some samples from the statistics:
For evenly distributed data with one unique value per row we get an accurate density, even with a
VARCHAR
histogram column and a sample size of 14294 rows.Now let's add a skewed value and update statistics again:
With a sample size of 17010 rows the density of the first column is smaller than it should be:
It's surprising that the
AVG_RANGE_ROWS
is pretty uniform for all steps at around 6.9, even for buckets of keys for which the sample couldn't have found duplicate values. I don't know why this is. The most likely explanation is that the algorithm used to guess at the missing pages does not do well with this data distribution and sample size.As stated earlier, it is possible to calculate the density for the FK column using the histogram. The sum of the
DISTINCT_RANGE_ROWS
values for all steps is 14497. There are 179 histogram steps so the density should be about 1 / (179 + 14497) = 0.00006813845 which is pretty close to the reported value.Testing with a larger table can show how the problem can get worse as the table gets bigger. This time we'll start with 1 M rows:
The statistics object isn't interesting yet. Density for
FK
is 1.025289E-06 which is close to exact (1.0E-06).Now let's add a skewed value and update statistics again:
With a sample size of 45627 rows the density of the first column is worse than before:
AVG_RANGE_ROWS
is up to 26. Interestingly, if I change the sample size to 170100 rows (10X the other table) then the average value forAVG_RANGE_ROWS
is again right around 6.9. As your table gets bigger SQL Server will pick a smaller sample size which means that it needs to make guesses about a larger percentage of pages in the table. This can exaggerate statistics problems for certain kinds of data skew.In conclusion, it's important to remember that SQL Server does not calculate the density like this:
Which for some data distributions will be very accurate. Instead it uses undocumented algorithms. In your question you said that your data wasn't skewed, but the
INSTANCEELEMENTID
value with the highest number of associated IDs has 12 and the most common number is 1. For the purposes of the algorithms used byStatman
that could be skewed.At that point there's nothing that you can do about it except for gathering stats with a higher sample rate. One common strategy is to gather statistics with
FULLSCAN
andNORECOMPUTE
. You can refresh the stats with a job on whatever interval makes sense for your rate of data change. In my experience aFULLSCAN
update isn't as bad as most people think, especially against an index. SQL Server can just scan the entire index instead of the entire table (as it would do for a rowstore table against an unindexed column). Additionally, in SQL Serer 2014 onlyFULLSCAN
stats updates are done in parallel, so aFULLSCAN
update can finish faster than some sampled updates.