Sql-server – Bizarre density results in sampled statistics

sql serversql server 2014statistics

A NC-index gets a totally different statistical distribution when estimated with sampling vs fullscan; the sampled one having a bizarre density vector. This results in poor execution plans.


I have a table of ~27M rows, with a non null FK-column supported by a nonclustered index. The table is clustered on its primary key. Both columns are varchar.

A fullscan statistics update for our FK-column gives a normal looking density vector:

All density Average Length  Columns
6,181983E-08    45,99747    INSTANCEELEMENTID
3,615442E-08    95,26874    INSTANCEELEMENTID, ID

That is, we're expected to read about 1.7 rows for each distinct INSTANCELEMENTID we are joining with.

A typical bin from the histogram looks like this:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR          133053      10      71366               1,679318

However, if we do a sampled update (using the default sample number which is 230k rows for this table) things take a turn to the bizarre:

4,773657E-06    45,99596    INSTANCEELEMENTID
3,702179E-08    95,30183    INSTANCEELEMENTID, ID

The density on INSTANCEELEMENTID is now two orders of magnitude larger. (The density for both columns however has been estimated to a quite acceptable value).

A typical bin from the histogram now looks like this;

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS     DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
FOOBAR          143870,4    766,2573    1247                115,3596
ZOTZOT          131560,7    1           969                 135,7092

which is a completely different distribution. Note that the INSTANCEELEMENTID with the highest number of associated IDs has 12, the most common number is 1. It's also very strange that some bins get EQ_ROWS = 1, this happens to about 10% of the bins.

There is no "unlucky" draw of strange rows that could contribute to this.

Am I reading the histogram correctly? Doesn't it look like the sampling has somehow scaled EQ_ROWS, DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS incorrectly?

The table is, as far as I can tell, unskewed. I've tried to emulate the sampler by estimating values myself with tablesample. Counting these results in a normal way gives results that agree with the fullscan version, not the sampler.

Furthermore, I've not been able to reproduce this behaviour on clustered indices.


I've narrowed this down to this to reproduce:

CREATE TABLE F_VAL (
    id varchar(100) primary key,
    num_l_val int not null
)

set nocount on

declare @rowlimit integer = 20000000;

The table must be sufficiently large for this to be observed. I've seen this with uniqueidentifer and varchar(100) but not int.

declare @i integer = 1;

declare @r float = rand()

while @i < @rowlimit
begin
set @r = rand()
insert f_val (id,num_l_val)
values (
   cast(@i as varchar(100)) + REPLICATE('f', 40 - len(@i)),
   case when @r > 0.8 then 4 when @r > 0.5 then 3 when @r > 0.4 then 2 else 1 end
)
  set @i = @i + 1

end

create table k_val (
 id int identity primary key,
 f_val varchar(100) not null,
)

insert into k_val(f_val)
select id from F_VAL
union all select id from f_val where num_l_val - 1 = 1
union all select id from f_val where num_l_val - 2 = 1
union all select id from f_val where num_l_val - 3 = 1
order by id

create nonclustered index IX_K_VAL_F_VAL  ON K_VAL (F_VAL)

update statistics K_VAL(IX_K_VAL_F_VAL) 
dbcc show_statistics (k_val,IX_k_VAL_F_VAL)

update statistics K_VAL(IX_K_VAL_F_VAL) WITH FULLSCAN
dbcc show_statistics (k_val,IX_k_VAL_F_VAL)

Compare the two statistics; the one with sampling is now representing a total different density vector and the histogram bins are off. Note that the table is not skewed.

Using int as datatype does not cause this, does SQL Server not examine the entire datapoint when using varchar?

It's worthwhile to mention that the problem seems to scale, increasing sample rate helps.

Best Answer

I've seen this same density problem on some of the nonclustered indexes on the largest databases that I have access to. First I'll start with a few observations that I've made about histograms and density calculations:

  • SQL Server is able to use the primary key on the table to infer something about the density of both columns. This means that the density that includes the PK columns will usually be very accurate.
  • The density calculation for the first column in the statistics is consistent with the histogram. If the histogram does not model the data well then the density may be off.
  • To create the histogram the StatMan function makes inferences about the data that is missing. The behavior can change depending on the data type of the column.

For one way to look at the problem, suppose that you sample 100 rows from a 10000 row table and you get 100 distinct values. One guess at what the rest of the data in the table is that there are 10000 unique values. Another guess is that there are 100 distinct values but each of the values are repeated 100 times. The second guess may seem unreasonable to you, which I will agree with. However, how do you balance the two approaches when the sampled data comes back unevenly distributed? There is some set of algorithms developed for this by Microsoft contained in the StatMan function. The algorithms may not work for all data disruptions and all sample levels.

Let's go through a relatively simple example. I'm going to use VARCHAR columns like in your table to see some of the same behavior. However, I'll just add one skewed value to the table. I'm testing against SQL Server 2016 SP1. Start with 100k rows with 100k unique values for the FK column:

DROP TABLE IF EXISTS X_STATS_SMALL;

CREATE TABLE X_STATS_SMALL (
ID VARCHAR(10) NOT NULL, 
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID)
);
-- insert 100k rows
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.GetNums(100000);

CREATE INDEX IX_X_STATS_SMALL ON X_STATS_SMALL (FK);

-- get sampled stats
UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;

Here are some samples from the statistics:

╔═════════════╦════════════════╦═════════╗
║ All density ║ Average Length ║ Columns ║
╠═════════════╬════════════════╬═════════╣
║ 1.00001E-05 ║ 4.888205       ║ FK      ║
║ 1.00001E-05 ║ 9.77641        ║ FK, ID  ║
╚═════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 1005         ║ 0          ║ 1       ║ 0                   ║ 1              ║
║ 10648        ║ 665.0898   ║ 1       ║ 664                 ║ 1.002173       ║
║ 10968        ║ 431.6008   ║ 1       ║ 432                 ║ 1              ║
║ 11182        ║ 290.0924   ║ 1       ║ 290                 ║ 1              ║
║ 1207         ║ 445.7517   ║ 1       ║ 446                 ║ 1              ║
║ ...          ║ ...        ║ ...     ║ ...                 ║ ...            ║
║ 99989        ║ 318.3941   ║ 1       ║ 318                 ║ 1              ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝

For evenly distributed data with one unique value per row we get an accurate density, even with a VARCHAR histogram column and a sample size of 14294 rows.

Now let's add a skewed value and update statistics again:

-- add 70k rows with a FK value of '35000'
INSERT INTO X_STATS_SMALL WITH (TABLOCK)
SELECT N + 100000 , '35000',  REPLICATE('Z', 900)
FROM dbo.GetNums(70000);

UPDATE STATISTICS X_STATS_SMALL IX_X_STATS_SMALL;

With a sample size of 17010 rows the density of the first column is smaller than it should be:

╔══════════════╦════════════════╦═════════╗
║ All density  ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 6.811061E-05 ║ 4.935802       ║ FK      ║
║ 5.882353E-06 ║ 10.28007       ║ FK, ID  ║
╚══════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦══════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS  ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬══════════╬═════════════════════╬════════════════╣
║ 10039        ║ 0          ║ 1        ║ 0                   ║ 1              ║
║ 10978        ║ 956.9945   ║ 1        ║ 138                 ║ 6.954391       ║
║ 11472        ║ 621.0283   ║ 1        ║ 89                  ║ 6.941863       ║
║ 1179         ║ 315.6046   ║ 1        ║ 46                  ║ 6.907561       ║
║ 11909        ║ 91.62713   ║ 1        ║ 14                  ║ 6.74198        ║
║ ...          ║ ...        ║ ...      ║ ...                 ║ ...            ║
║ 35000        ║ 376.6893   ║ 69195.05 ║ 54                  ║ 6.918834       ║
║ ...          ║ ...        ║ ...      ║ ...                 ║ ...            ║
║ 99966        ║ 325.7854   ║ 1        ║ 47                  ║ 6.909731       ║
╚══════════════╩════════════╩══════════╩═════════════════════╩════════════════╝

It's surprising that the AVG_RANGE_ROWS is pretty uniform for all steps at around 6.9, even for buckets of keys for which the sample couldn't have found duplicate values. I don't know why this is. The most likely explanation is that the algorithm used to guess at the missing pages does not do well with this data distribution and sample size.

As stated earlier, it is possible to calculate the density for the FK column using the histogram. The sum of the DISTINCT_RANGE_ROWS values for all steps is 14497. There are 179 histogram steps so the density should be about 1 / (179 + 14497) = 0.00006813845 which is pretty close to the reported value.

Testing with a larger table can show how the problem can get worse as the table gets bigger. This time we'll start with 1 M rows:

DROP TABLE IF EXISTS X_STATS_LARGE;

CREATE TABLE X_STATS_LARGE (
ID VARCHAR(10) NOT NULL,
FK VARCHAR(10) NOT NULL,
PADDING VARCHAR(900) NOT NULL,
PRIMARY KEY (ID));

INSERT INTO X_STATS_LARGE WITH (TABLOCK)
SELECT N, N, REPLICATE('Z', 900)
FROM dbo.Getnums(1000000);

CREATE INDEX IX_X_STATS_LARGE ON X_STATS_LARGE (FK);

-- get sampled stats
UPDATE STATISTICS X_STATS_LARGE IX_X_STATS_LARGE;

The statistics object isn't interesting yet. Density for FK is 1.025289E-06 which is close to exact (1.0E-06).

Now let's add a skewed value and update statistics again:

INSERT INTO X_STATS_LARGE WITH (TABLOCK)
SELECT N + 1000000 , '350000',  REPLICATE('Z', 900)
FROM dbo.Getnums(700000);

UPDATE STATISTICS X_STATS_LARGE IX_X_STATS_LARGE;

With a sample size of 45627 rows the density of the first column is worse than before:

╔══════════════╦════════════════╦═════════╗
║ All density  ║ Average Length ║ Columns ║
╠══════════════╬════════════════╬═════════╣
║ 2.60051E-05  ║ 5.93563        ║ FK      ║
║ 5.932542E-07 ║ 12.28485       ║ FK, ID  ║
╚══════════════╩════════════════╩═════════╝

╔══════════════╦════════════╦═════════╦═════════════════════╦════════════════╗
║ RANGE_HI_KEY ║ RANGE_ROWS ║ EQ_ROWS ║ DISTINCT_RANGE_ROWS ║ AVG_RANGE_ROWS ║
╠══════════════╬════════════╬═════════╬═════════════════════╬════════════════╣
║ 100023       ║ 0          ║ 1       ║ 0                   ║ 1              ║
║ 107142       ║ 8008.354   ║ 1       ║ 306                 ║ 26.17787       ║
║ 110529       ║ 4361.357   ║ 1       ║ 168                 ║ 26.02392       ║
║ 114558       ║ 3722.193   ║ 1       ║ 143                 ║ 26.01217       ║
║ 116696       ║ 2556.658   ║ 1       ║ 98                  ║ 25.97568       ║
║ ...          ║ ...        ║ ...     ║ ...                 ║ ...            ║
║ 350000       ║ 5000.522   ║ 700435  ║ 192                 ║ 26.03268       ║
║ ...          ║ ...        ║ ...     ║ ...                 ║ ...            ║
║ 999956       ║ 2406.266   ║ 1       ║ 93                  ║ 25.96841       ║
╚══════════════╩════════════╩═════════╩═════════════════════╩════════════════╝

AVG_RANGE_ROWS is up to 26. Interestingly, if I change the sample size to 170100 rows (10X the other table) then the average value for AVG_RANGE_ROWS is again right around 6.9. As your table gets bigger SQL Server will pick a smaller sample size which means that it needs to make guesses about a larger percentage of pages in the table. This can exaggerate statistics problems for certain kinds of data skew.

In conclusion, it's important to remember that SQL Server does not calculate the density like this:

SELECT COUNT(DISTINCT FK) * 1700000. / COUNT(*) -- 1071198.9 distinct values for one run
FROM X_STATS_LARGE TABLESAMPLE (45627 ROWS);

Which for some data distributions will be very accurate. Instead it uses undocumented algorithms. In your question you said that your data wasn't skewed, but the INSTANCEELEMENTID value with the highest number of associated IDs has 12 and the most common number is 1. For the purposes of the algorithms used by Statman that could be skewed.

At that point there's nothing that you can do about it except for gathering stats with a higher sample rate. One common strategy is to gather statistics with FULLSCAN and NORECOMPUTE. You can refresh the stats with a job on whatever interval makes sense for your rate of data change. In my experience a FULLSCAN update isn't as bad as most people think, especially against an index. SQL Server can just scan the entire index instead of the entire table (as it would do for a rowstore table against an unindexed column). Additionally, in SQL Serer 2014 only FULLSCAN stats updates are done in parallel, so a FULLSCAN update can finish faster than some sampled updates.