Sql-server – Can the SQL Server optimizer use indexes to update auto created statistics

sql serversql-server-2017

Consider the scenario where auto created statistics are created on a column to satisfy a query.

Afterwards; a nonclustered index is created on the column, with or without other key columns.

Could the SQL Server optimizer leverage indexes to update these auto created statistics when running UPDATE STATISTICS statements?

Is there a difference between (default) sample rates & fullscan statistic updates when adding these indexes?

SELECT @@VERSION;

Microsoft SQL Server 2017 (RTM-CU12) (KB4464082) - 14.0.3045.24 (X64)   Oct 18 2018 23:11:05   Copyright (C) 2017 Microsoft Corporation  Developer Edition (64-bit) on Windows Server 2016 Datacenter 10.0 <X64> (Build 14393: ) (Hypervisor) 

Best Answer


TL;DR

Auto created statistics will be improved by an index if you are using fullscan updates & have the column as the first key column in an index.

(Default) sample rate statistic updates will use the clustered index / heap table scan even when a NCI exists with the column as the first key column.

Even when updating the statistic of the NCI.

As such, updating statistics with a sample rate will not be positively impacted by adding an index.


Setup

Creating a table with a primary key & 10M rows:

CREATE TABLE dbo.testing(ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL, val nchar(500));

INSERT INTO dbo.testing  WITH(TABLOCK) (val)
SELECT top(10000000) REPLICATE('b',((ROW_NUMBER()OVER (ORDER BY (SELECT NULL))) % 400))
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2
CROSS APPLY master..spt_values spt3;

A nchar(500) value is used to show a clear difference between statistic updates.

We can create an auto created statistic on the val column with this query:

SELECT val
FROM dbo.testing 
where val = 'bla'
order by val;

Full scan stat updates

Updating the statistic with fullscan:

SET STATISTICS IO, TIME ON;
UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH FULLSCAN;

Results in 127 seconds of cpu tme on my machine:

CPU time = 127141 ms,  elapsed time = 80147 ms.

Now when we look at the query that the stat updates actually runs:

SELECT StatMan([SC0]) FROM (SELECT TOP 100 PERCENT [val] AS [SC0] FROM [dbo].[testing] WITH (READUNCOMMITTED)  ORDER BY [SC0] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 2);

We can see that it went parallel on the two cores on my machine & did order by the val column.

As a side note, there was also a memory grant & tempdb was used to support the ordering, The memory grant was as high as 12484 MB or ~ 12 GB

Let's see what happens when we create this index:

CREATE INDEX IX_val
on dbo.testing(val);

And re-run the stat update:

SET STATISTICS IO, TIME ON;
UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH FULLSCAN;

This results in an execution time of

 CPU time = 10578 ms,  elapsed time = 10882 ms.

And the memory grant & tempdb usage is gone since it does not have to sort anymore.

The statistic update with fullscan on the auto created stat is able to use the index.

This would also work for an index like this:

CREATE INDEX IX_val_val2
on dbo.testing(val,val2);

But not when creating an index like this:

CREATE INDEX IX_val2_val
on dbo.testing(val2,val);

Sample rates

If we change the update stats to sample 10 percent instead of doing a fullscan:

UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH SAMPLE 10 PERCENT

Without an index this amounts to

CPU time = 10500 ms,  elapsed time = 11100 ms.

With an index this amounts to

CPU time = 10312 ms,  elapsed time = 11038 ms.

Auto created statistics with sample rates are not positively impacted by indexes on the column where the column is the first key column. Why this happens is explained further below


Comparison with updating the index' stats

As expected, updating the statistic on the index takes the same amount of time as the auto created statistics.

UPDATE STATISTICS dbo.testing(IX_val) WITH SAMPLE 10 PERCENT

CPU time = 11203 ms,  elapsed time = 11849 ms.

UPDATE STATISTICS dbo.testing(IX_val) WITH FULLSCAN;

CPU time = 10938 ms,  elapsed time = 11005 ms.

Sampled statistic update on an indexed column stat

The query used when updating statistics with a sample rate of 10 percent is different:

SELECT StatMan([SC0], [SC1], [SB0000]) FROM (SELECT TOP 100 PERCENT [SC0], [SC1], step_direction([SC0]) over (order by NULL) AS [SB0000]  
FROM (SELECT [val] AS [SC0], [ID] AS [SC1] 
FROM [dbo].[testing] TABLESAMPLE SYSTEM (9.999986e+000 PERCENT) WITH (READUNCOMMITTED) ) 
AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SC1], [SB0000] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 1)

TABLESAMPLE will give us random pages and it's explanation can be found here.

The statistic orders by val,id,step_direction([SC0]) over (order by NULL)

We can trace the query plan of this StatMan query with the profiler & the Showplan XML Statistics Profile event:

UPDATE STATISTICS dbo.testing(IX_val) WITH SAMPLE 10 PERCENT

The execution plan shows the clustered index being used, instead of the expected nonclustered index:

enter image description here

I cannot share the query plan with paste the plan unfortunately, it is malformed.

The same is true for heap tables. The usage of the NCI would not make a difference here, the TABLESAMPLE SYSTEM (9.999986e+000 PERCENT) retrieves 10 percent of the data pages in a random order. This is why the CI was used.