Sql-server – Can the SQL Server optimizer use indexes to update auto created statistics

sql serversql-server-2017

Consider the scenario where auto created statistics are created on a column to satisfy a query.

Afterwards; a nonclustered index is created on the column, with or without other key columns.

Could the SQL Server optimizer leverage indexes to update these auto created statistics when running UPDATE STATISTICS statements?

Is there a difference between (default) sample rates & fullscan statistic updates when adding these indexes?

SELECT @@VERSION;

Microsoft SQL Server 2017 (RTM-CU12) (KB4464082) - 14.0.3045.24 (X64)   Oct 18 2018 23:11:05   Copyright (C) 2017 Microsoft Corporation  Developer Edition (64-bit) on Windows Server 2016 Datacenter 10.0 <X64> (Build 14393: ) (Hypervisor)

Best Answer

TL;DR

Auto created statistics will be improved by an index if you are using fullscan updates & have the column as the first key column in an index.

(Default) sample rate statistic updates will use the clustered index / heap table scan even when a NCI exists with the column as the first key column.

Even when updating the statistic of the NCI.

As such, updating statistics with a sample rate will not be positively impacted by adding an index.

Setup

Creating a table with a primary key & 10M rows:

CREATE TABLE dbo.testing(ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL, val nchar(500));

INSERT INTO dbo.testing  WITH(TABLOCK) (val)
SELECT top(10000000) REPLICATE('b',((ROW_NUMBER()OVER (ORDER BY (SELECT NULL))) % 400))
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2
CROSS APPLY master..spt_values spt3;

A nchar(500) value is used to show a clear difference between statistic updates.

We can create an auto created statistic on the val column with this query:

SELECT val
FROM dbo.testing 
where val = 'bla'
order by val;

Full scan stat updates

Updating the statistic with fullscan:

SET STATISTICS IO, TIME ON;
UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH FULLSCAN;

Results in 127 seconds of cpu tme on my machine:

CPU time = 127141 ms,  elapsed time = 80147 ms.

Now when we look at the query that the stat updates actually runs:

SELECT StatMan([SC0]) FROM (SELECT TOP 100 PERCENT [val] AS [SC0] FROM [dbo].[testing] WITH (READUNCOMMITTED)  ORDER BY [SC0] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 2);

We can see that it went parallel on the two cores on my machine & did order by the val column.

As a side note, there was also a memory grant & tempdb was used to support the ordering, The memory grant was as high as 12484 MB or ~ 12 GB

Let's see what happens when we create this index:

CREATE INDEX IX_val
on dbo.testing(val);

And re-run the stat update:

SET STATISTICS IO, TIME ON;
UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH FULLSCAN;

This results in an execution time of

 CPU time = 10578 ms,  elapsed time = 10882 ms.

And the memory grant & tempdb usage is gone since it does not have to sort anymore.

The statistic update with fullscan on the auto created stat is able to use the index.

This would also work for an index like this:

CREATE INDEX IX_val_val2
on dbo.testing(val,val2);

But not when creating an index like this:

CREATE INDEX IX_val2_val
on dbo.testing(val2,val);

Sample rates

If we change the update stats to sample 10 percent instead of doing a fullscan:

UPDATE STATISTICS dbo.testing([_WA_Sys_00000002_4F5F3681]) WITH SAMPLE 10 PERCENT

Without an index this amounts to

CPU time = 10500 ms,  elapsed time = 11100 ms.

With an index this amounts to

CPU time = 10312 ms,  elapsed time = 11038 ms.

Auto created statistics with sample rates are not positively impacted by indexes on the column where the column is the first key column. Why this happens is explained further below

Comparison with updating the index' stats

As expected, updating the statistic on the index takes the same amount of time as the auto created statistics.

UPDATE STATISTICS dbo.testing(IX_val) WITH SAMPLE 10 PERCENT

CPU time = 11203 ms,  elapsed time = 11849 ms.

UPDATE STATISTICS dbo.testing(IX_val) WITH FULLSCAN;

CPU time = 10938 ms,  elapsed time = 11005 ms.

Sampled statistic update on an indexed column stat

The query used when updating statistics with a sample rate of 10 percent is different:

SELECT StatMan([SC0], [SC1], [SB0000]) FROM (SELECT TOP 100 PERCENT [SC0], [SC1], step_direction([SC0]) over (order by NULL) AS [SB0000]  
FROM (SELECT [val] AS [SC0], [ID] AS [SC1] 
FROM [dbo].[testing] TABLESAMPLE SYSTEM (9.999986e+000 PERCENT) WITH (READUNCOMMITTED) ) 
AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SC1], [SB0000] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 1)

TABLESAMPLE will give us random pages and it's explanation can be found here.

The statistic orders by val,id,step_direction([SC0]) over (order by NULL)

We can trace the query plan of this StatMan query with the profiler & the Showplan XML Statistics Profile event:

UPDATE STATISTICS dbo.testing(IX_val) WITH SAMPLE 10 PERCENT

The execution plan shows the clustered index being used, instead of the expected nonclustered index:

I cannot share the query plan with paste the plan unfortunately, it is malformed.

The same is true for heap tables. The usage of the NCI would not make a difference here, the TABLESAMPLE SYSTEM (9.999986e+000 PERCENT) retrieves 10 percent of the data pages in a random order. This is why the CI was used.

Related Solutions

Sql-server – Workaround for very long running query on the schema

It might help to update statistics on the system tables that are used by the query. Based on your STATISTICS IO output you would run the following code:

UPDATE STATISTICS sys.syssingleobjrefs WITH FULLSCAN;
UPDATE STATISTICS sys.sysidxstats WITH FULLSCAN;
UPDATE STATISTICS sys.sysschobjs WITH FULLSCAN;
UPDATE STATISTICS sys.sysobjvalues WITH FULLSCAN;
UPDATE STATISTICS sys.syscolpars WITH FULLSCAN;
UPDATE STATISTICS sys.sysclsobjs WITH FULLSCAN;
UPDATE STATISTICS sys.sysiscols WITH FULLSCAN;
UPDATE STATISTICS sys.sysscalartypes WITH FULLSCAN;

"Statistics on system tables and query performance" contains a reference to this technique.

Depending on your database updating statistics may not be sufficient. I created a test database with 10000 simple tables with one column each:

DECLARE 
@j INT = 0,
@create_table VARCHAR(1000);
BEGIN
    WHILE @j < 10000
    BEGIN
        SET @create_table = 'CREATE TABLE X_TABLE_' + CAST(@j AS VARCHAR) + ' (ID INT NULL)';

        EXEC (@create_table);

        SET @j = @j + 1;
    END;
END;

The query that you have should return 10k rows but it still hadn't finished after one minute. It had only returned 651 rows so it was on track to finish in 15 minutes.

One change that made a big difference on my database was enabling the legacy cardinality estimator. The query returned all 10000 rows in one second after adding the following hint:

OPTION (RECOMPILE, USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION') );

As you said you cannot modify the code so that isn't immediately helpful. However, you can enable trace flag 9481 at the session level to get the legacy CE. It's possible to enable trace flag 9481 for a particular user as described in "Wanting your non-sysadmin users to enable certain trace flags without changing your app?". However, I do not know what other effects this would have on your application.

If it turns out that the legacy CE also improves performance in your database a less disruptive option may be to create a plan guide for this query. A plan guide will force a query to have a particular execution plan. You could capture a good enough execution play using the legacy CE and force future query executions to use that same plan even without TF 9481 turned on. You should be aware that many people in the industry do not speak favorably of plan guides.

Sql-server – Unexplainable DBCC CHECKDB Command Output

The error message you're seeing:

Sqlcmd: Warning: The last operation was terminated because the user pressed CTRL+C.

Occurs when sqlcmd is asked to quit while it is waiting for a response from SQL Server. Most likely someone (or something) killed the spawned cmd.exe process that was running SQLCMD.exe, which resulting in that error message being directed to the SQL Server errorlog.

Instead of using SQLCMD to execute DBCC CHECKDB, you should simply setup a standard T-SQL job step to run the DBCC CHECKDB command directly.

Alternately, run Ola Hallengren's excellent free and open-source Maintenance Solution, or one of the other myriad solutions available.

Best Answer

TL;DR

Related Solutions

Sql-server – Workaround for very long running query on the schema

Sql-server – Unexplainable DBCC CHECKDB Command Output

Related Question