SQL Server Statistics – How Are the Number of Histogram Steps Decided?

sql serverstatistics

How are the number of histogram steps decided in Statistics in SQL Server?

Why is it restricted to 200 steps even though my key column has more than 200 distinct values? Is there any deciding factor?


Demo

Schema definition

CREATE TABLE histogram_step
  (
     id   INT IDENTITY(1, 1),
     name VARCHAR(50),
     CONSTRAINT pk_histogram_step PRIMARY KEY (id)
  )

Inserting 100 records into my table

INSERT INTO histogram_step
            (name)
SELECT TOP 100 name
FROM   sys.syscolumns

Updating and checking the statistics

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)

Histogram steps:

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|            3 |          1 |       1 |                   1 |              1 |
|            5 |          1 |       1 |                   1 |              1 |
|            7 |          1 |       1 |                   1 |              1 |
|            9 |          1 |       1 |                   1 |              1 |
|           11 |          1 |       1 |                   1 |              1 |
|           13 |          1 |       1 |                   1 |              1 |
|           15 |          1 |       1 |                   1 |              1 |
|           17 |          1 |       1 |                   1 |              1 |
|           19 |          1 |       1 |                   1 |              1 |
|           21 |          1 |       1 |                   1 |              1 |
|           23 |          1 |       1 |                   1 |              1 |
|           25 |          1 |       1 |                   1 |              1 |
|           27 |          1 |       1 |                   1 |              1 |
|           29 |          1 |       1 |                   1 |              1 |
|           31 |          1 |       1 |                   1 |              1 |
|           33 |          1 |       1 |                   1 |              1 |
|           35 |          1 |       1 |                   1 |              1 |
|           37 |          1 |       1 |                   1 |              1 |
|           39 |          1 |       1 |                   1 |              1 |
|           41 |          1 |       1 |                   1 |              1 |
|           43 |          1 |       1 |                   1 |              1 |
|           45 |          1 |       1 |                   1 |              1 |
|           47 |          1 |       1 |                   1 |              1 |
|           49 |          1 |       1 |                   1 |              1 |
|           51 |          1 |       1 |                   1 |              1 |
|           53 |          1 |       1 |                   1 |              1 |
|           55 |          1 |       1 |                   1 |              1 |
|           57 |          1 |       1 |                   1 |              1 |
|           59 |          1 |       1 |                   1 |              1 |
|           61 |          1 |       1 |                   1 |              1 |
|           63 |          1 |       1 |                   1 |              1 |
|           65 |          1 |       1 |                   1 |              1 |
|           67 |          1 |       1 |                   1 |              1 |
|           69 |          1 |       1 |                   1 |              1 |
|           71 |          1 |       1 |                   1 |              1 |
|           73 |          1 |       1 |                   1 |              1 |
|           75 |          1 |       1 |                   1 |              1 |
|           77 |          1 |       1 |                   1 |              1 |
|           79 |          1 |       1 |                   1 |              1 |
|           81 |          1 |       1 |                   1 |              1 |
|           83 |          1 |       1 |                   1 |              1 |
|           85 |          1 |       1 |                   1 |              1 |
|           87 |          1 |       1 |                   1 |              1 |
|           89 |          1 |       1 |                   1 |              1 |
|           91 |          1 |       1 |                   1 |              1 |
|           93 |          1 |       1 |                   1 |              1 |
|           95 |          1 |       1 |                   1 |              1 |
|           97 |          1 |       1 |                   1 |              1 |
|           99 |          1 |       1 |                   1 |              1 |
|          100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

As we can see there are 53 steps in the histogram.

Again inserting few thousand records

INSERT INTO histogram_step
            (name)
SELECT TOP 10000 b.name
FROM   sys.syscolumns a
       CROSS JOIN sys.syscolumns b

Updating and checking the statistics

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step)

Now the histogram steps are reduced to 4 steps

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|        10088 |      10086 |       1 |               10086 |              1 |
|        10099 |         10 |       1 |                  10 |              1 |
|        10100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

Again inserting few thousand records

INSERT INTO histogram_step
            (name)
SELECT TOP 100000 b.name
FROM   sys.syscolumns a
       CROSS JOIN sys.syscolumns b

Updating and checking the statistics

UPDATE STATISTICS histogram_step WITH fullscan

DBCC show_statistics('histogram_step', pk_histogram_step) 

Now the histogram steps are reduced to 3 steps

+--------------+------------+---------+---------------------+----------------+
| RANGE_HI_KEY | RANGE_ROWS | EQ_ROWS | DISTINCT_RANGE_ROWS | AVG_RANGE_ROWS |
+--------------+------------+---------+---------------------+----------------+
|            1 |          0 |       1 |                   0 |              1 |
|       110099 |     110097 |       1 |              110097 |              1 |
|       110100 |          0 |       1 |                   0 |              1 |
+--------------+------------+---------+---------------------+----------------+

Can someone tell me how these steps are decided?

Best Answer

I'm going to limit this post to discussing single column statistics because it'll already be pretty lengthy and you're interested in how SQL Server buckets the data into histogram steps. For multi column statistics the histogram is only created on the leading column.

When SQL Server determines that a statistics update is needed it kicks off a hidden query that reads either all of a table's data or a sample of the table's data. You can view these queries with extended events. There is a function called StatMan within SQL Server that is involved with creating the histograms. For simple statistics objects there are at least two different types of StatMan queries (there are different queries for quick stat updates and I suspect that the incremental stats feature on partitioned tables also uses a different query).

The first one just grabs all of the data from the table without any filtering. You can see this when the table is very small or you gather stats with the FULLSCAN option:

CREATE TABLE X_SHOW_ME_STATMAN (N INT);
CREATE STATISTICS X_STAT_X_SHOW_ME_STATMAN ON X_SHOW_ME_STATMAN (N);

-- after gathering stats with 1 row in table
SELECT StatMan([SC0]) FROM
(
    SELECT TOP 100 PERCENT [N] AS [SC0] 
    FROM [dbo].[X_SHOW_ME_STATMAN] WITH (READUNCOMMITTED)
    ORDER BY [SC0] 
) AS _MS_UPDSTATS_TBL 
OPTION (MAXDOP 16);

SQL Server picks the automatic sample size based on the size of the table (I think that it's both number of rows and pages in the table). If a table is too big then the automatic sample size falls below 100%. Here's what I got for the same table with 1M rows:

-- after gathering stats with 1 M rows in table
SELECT StatMan([SC0], [SB0000]) FROM 
(
    SELECT TOP 100 PERCENT [SC0], step_direction([SC0]) over (order by NULL) AS [SB0000] 
    FROM 
    (
        SELECT [N] AS [SC0] 
        FROM [dbo].[X_SHOW_ME_STATMAN] TABLESAMPLE SYSTEM (6.666667e+001 PERCENT) WITH (READUNCOMMITTED) 
    ) AS _MS_UPDSTATS_TBL_HELPER 
    ORDER BY [SC0], [SB0000] 
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1);

TABLESAMPLE is documented but StatMan and step_direction are not. here SQL Server samples around 66.6% of the data from the table to create the histogram. What this means is that you could get a different number of histogram steps when updating stats (without FULLSCAN) on the same data. I've never observed this in practice but I don't see why it wouldn't be possible.

Let's run a few tests on simple data to see how the stats change over time. Below is some test code that I wrote to insert sequential integers into a table, gather stats after each insert, and save information about the stats into a results table. Let's start with just inserting 1 row at a time up to 10000. Test bed:

DECLARE
@stats_id INT,
@table_object_id INT,
@rows_per_loop INT = 1,
@num_of_loops INT = 10000,
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE X_STATS_RESULTS;

    SET @table_object_id = OBJECT_ID ('X_SEQ_NUM');
    SELECT @stats_id = stats_id FROM sys.stats
    WHERE OBJECT_ID = @table_object_id
    AND name = 'X_STATS_SEQ_INT_FULL';

    SET @loop_num = 0;
    WHILE @loop_num < @num_of_loops
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_NUM WITH (TABLOCK)
        SELECT @rows_per_loop * (@loop_num - 1) + N FROM dbo.GetNums(@rows_per_loop);

        UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN; -- can comment out FULLSCAN as needed

        INSERT INTO X_STATS_RESULTS WITH (TABLOCK)
        SELECT 'X_STATS_SEQ_INT_FULL', @rows_per_loop * @loop_num, rows_sampled, steps 
        FROM sys.dm_db_stats_properties(@table_object_id, @stats_id);
        END;
END;

For this data the number of histogram steps quickly increases to 200 (it first hits the max number of steps with 397 rows), stays at 199 or 200 until 1485 rows are in the table, then slowly decreases until the histogram only has 3 or 4 steps. Here's a graph of all of the data:

first graph

Here's the the histogram looks like for 10k rows:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           1       0                   1
9999            9997        1       9997                1
10000           0           1       0                   1

Is it a problem that the histogram only has 3 steps? It looks like information is preserved from our point of view. Note that because the datatype is an INTEGER we can figure out how many rows are in the table for each integer from 1 - 10000. Typically SQL Server can figure this out too, although there are some cases in which this doesn't quite work out. See this SE post for an example of this.

What do you think will happen if we delete a single row from the table and update stats? Ideally we'd get another histogram step to show that the missing integer is no longer in the table.

DELETE FROM X_SEQ_NUM
WHERE X_NUM  = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

DELETE FROM X_SEQ_NUM
WHERE X_NUM  IN (2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 3 steps

That's a little disappointing. If we were building a histogram by hand we would add a step for each missing value. SQL Server is using a general purpose algorithm so for some data sets we may be able to come up with a more suitable histogram than the code that it uses. Of course, the practical difference between getting 0 or 1 row from a table is very small. I get the same results when testing with 20000 rows which each integer having 2 rows in the table. The histogram does not gain steps as I delete data.

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           2       0                   1
9999            19994       2       9997                2
10000           0           2       0                   1

If I test with 1 million rows with each integer having 100 rows in the table I get slightly better results, but I can still construct a better histogram by hand.

truncate table X_SEQ_NUM;

BEGIN TRANSACTION;
INSERT INTO X_SEQ_NUM WITH (TABLOCK)
SELECT N FROM dbo.GetNums(10000);
GO 100
COMMIT TRANSACTION;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- 4 steps

DELETE FROM X_SEQ_NUM
WHERE X_NUM  = 1000;

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- now 5 steps with a RANGE_HI_KEY of 998 (?)

DELETE FROM X_SEQ_NUM
WHERE X_NUM  IN (2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000);

UPDATE STATISTICS X_SEQ_NUM X_STATS_SEQ_INT_FULL WITH FULLSCAN;

DBCC SHOW_STATISTICS ('X_SEQ_NUM', 'X_STATS_SEQ_INT_FULL'); -- still 5 steps

Final histogram:

RANGE_HI_KEY    RANGE_ROWS  EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1               0           100     0                   1
998             99600       100     996                 100
3983            298100      100     2981                100
9999            600900      100     6009                100
10000           0           100     0                   1

Let's test further with sequential integers but with more rows in the table. Note that for tables that are too small manually specifying a sample size will have no effect, so I will add 100 rows in each insert and gather stats each time up to 1 million rows. I see a similar pattern as before, except once I get to 637300 rows in the table I no longer sample 100% of the rows in the table with the default sample rate. As I gain rows the number of histogram steps increases. Perhaps this is because SQL Server ends up with more gaps in the data as the number of unsampled rows in the table increases. I do not hit 200 steps even at 1 M rows, but if I kept adding rows I expect I would get there and eventually start going back down.

graph 2

The X-axis is the number of rows in the table. As the number of rows increases the rows sampled varies a bit and doesn't go over 650k.

Now let's do some simple tests with VARCHAR data.

CREATE TABLE X_SEQ_STR (X_STR VARCHAR(5));
CREATE STATISTICS X_SEQ_STR ON X_SEQ_STR(X_STR);

Here I'm inserting 200 numbers (as strings) along with NULL.

INSERT INTO X_SEQ_STR
SELECT N FROM dbo.GetNums(200)
UNION ALL
SELECT NULL;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 111 steps, RANGE_ROWS is 0 or 1 for all steps

Note that NULL always gets its own histogram step when it is found in the table. SQL Server could have given me exactly 201 steps to preserve all information but it did not do that. Technically information is lost because '1111' sorts between '1' and '2' for example.

Now let's try inserting different characters instead of just integers:

truncate table X_SEQ_STR;

INSERT INTO X_SEQ_STR
SELECT CHAR(10 + N) FROM dbo.GetNums(200)
UNION ALL
SELECT NULL;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 95 steps, RANGE_ROWS is 0 or 1 or 2

No real difference from the last test.

Now let's try inserting characters but putting different numbers of each character in the table. For example, CHAR(11) has 1 row, CHAR(12) has 2 rows, etc.

truncate table X_SEQ_STR;

DECLARE
@loop_num INT;

BEGIN
    SET NOCOUNT ON;

    SET @loop_num = 0;
    WHILE @loop_num < 200
    BEGIN
        SET @loop_num = @loop_num + 1;

        INSERT INTO X_SEQ_STR WITH (TABLOCK)
        SELECT CHAR(10 + @loop_num) FROM dbo.GetNums(@loop_num);
    END;
END;

UPDATE STATISTICS X_SEQ_STR X_SEQ_STR ;

DBCC SHOW_STATISTICS ('X_SEQ_STR', 'X_SEQ_STR'); -- 148 steps, most with RANGE_ROWS of 0

As before I still don't get exactly 200 histogram steps. However, many of the steps have RANGE_ROWS of 0.

For the final test, I'm going to insert a random string of 5 characters in each loop and gather stats each time. Here's the code the random string:

char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))+char((rand()*25 + 65))

Here is the graph of rows in table vs histogram steps: graph 3

Note that the number of steps doesn't dip below 100 once it starts going up and down. I've heard from somewhere (but can't source it right now) that the SQL Server histogram building algorithm combines histogram steps as it runs out of room for them. So you can end up with drastic changes in the number of steps just by adding a little data. Here's one sample of the data that I found interesting:

ROWS_IN_TABLE   ROWS_SAMPLED    STEPS
36661           36661           133
36662           36662           143
36663           36663           143
36664           36664           141
36665           36665           138

Even when sampling with FULLSCAN, adding a single row can increase the number of steps by 10, keep it constant, then decrease it by 2, then decrease it by 3.

What can we summarize from all of this? I can't prove any of this, but these observations appear to hold true:

  • SQL Server uses a general use algorithm to create the histograms. For some data distributions it may be possible to create a more complete representation of the data by hand.
  • If there is NULL data in the table and the stats query finds it then that NULL data always gets its own histogram step.
  • The minimum value found in the table gets its own histogram step with RANGE_ROWS = 0.
  • The maximum value found in the table will be the final RANGE_HI_KEY in the table.
  • As SQL Server samples more data it may need to combine existing steps to make room for the new data that it finds. If you look at enough histograms you may see common values repeat for DISTINCT_RANGE_ROWS or RANGE_ROWS. For example, 255 shows up a bunch of times for RANGE_ROWS and DISTINCT_RANGE_ROWS for the final test case here.
  • For simple data distributions you may see SQL Server combine sequential data into one histogram step that causes no loss of information. However when adding gaps to the data the histogram may not adjust in the way you would hope.

When is all of this a problem? It's a problem when a query performs poorly due to a histogram that is unable to represent the data distribution in a way for the query optimizer to make good decisions. I think there's a tendency to think that having more histogram steps is always better and for there to be consternation when SQL Server generates a histogram on millions of rows or more but doesn't use exactly 200 or 201 histogram steps. However, I have seen plenty of stats problems even when the histogram has 200 or 201 steps. We don't have any control over how many histogram steps that SQL Server generates for a statistics object so I wouldn't worry about it. However, there are some steps that you can take when you experience poor performing queries caused by stats issues. I will give an extremely brief overview.

Gathering statistics in full can help in some cases. For very large tables the auto sample size may be less than 1% of the rows in the table. Sometimes that can lead to bad plans depending on the data disruption in the column. Microsofts's documentation for CREATE STATISTICS and UPDATE STATISTICS says as much:

SAMPLE is useful for special cases in which the query plan, based on default sampling, is not optimal. In most situations, it is not necessary to specify SAMPLE because the query optimizer already uses sampling and determines the statistically significant sample size by default, as required to create high-quality query plans.

For most workloads, a full scan is not required, and default sampling is adequate. However, certain workloads that are sensitive to widely varying data distributions may require an increased sample size, or even a full scan.

In some cases creating filtered statistics can help. You may have a column with skewed data and many different distinct values. If there are certain values in the data that are commonly filtered on you can create a statistics histogram for just those common values. The query optimizer can use the statistics defined on a smaller range of data instead of the statistics defined on all column values. You still are not guaranteed to get 200 steps in the histogram, but if you create the filtered stats on just one value you will a histogram step that value.

Using a partitioned view is one way to effectively get more than 200 steps for a table. Suppose that you can easily split up a large table into one table per year. You create a UNION ALL view that combines all of the yearly tables. Each table will have its own histogram. Note that the new incremental statistics introduced in SQL Server 2014 only allows stats updates to be more efficient. The query optimizer will not use the statistics that are created per partition.

There are many more tests that could be run here, so I encourage you to experiment. I did this testing on SQL Server 2014 express so really there's nothing stopping you.