Sql-server – Does number of rows affect Update Statistics performance if you are sampling with n rows

sql serversql server 2014statistics

For example, I have table A with 80 million rows and table B with 200 million rows.

Assume both tables have 30+ stat tables (all manually generated by SQL Server).
I randomly take the first 10 stat tables from each and run the two tasks separately:

Update Statistics A (stat1, stat2, ...) with sample 5000000 rows
Update Statistics B (stat1, stat2, ...) with sample 5000000 rows

Even though both are scanning 5000000 rows only, the second task is taking much longer to complete. Is this difference in performance solely from the difference in the contents of the tables, or does the number of rows in the table also play a part even if we are limiting the number of rows to scan?

Best Answer

Update statistics is a single threaded task (You can run multiple instances of it in parallell though) so duration of the statistics run will be correlated with the number of records the task has to scan in order to build the statistics and selecting the rows to sample.

Even if you are only sampling a fixed number of rows the population you are sampling from is larger therefore the scan will be more time consuming.

There are some interesting details about effects and use of tempdb in statistics scanning in Erin Stellato's post about statistics but almost all you want to know is contained in this Robert Sheldon's article

Related Solutions

Sql-server – UPDATE Statistics with full scan

a) Pretty much anything takes a schema stability lock. You don't want something else changing the structure of the table while you are updating your statistics. According to this, update statistics takes schema stability and modification locks.

b) If something tries to change the table's structure, it will be blocked. IIRC, update stats does dirty reads, so it shouldn't block connections that are merely reading or writing.

c) If you use FULLSCAN, it will read the entire table because that is what you told it to do. I don't see how that can be seen as anything but 'causing heavy i/o'. Normally the default of 'sampling' works well enough, but I have seen it cause problems with data with non-homogenous distributions. Often, it's also easier to just reindex the whole table (especially if you can do it online) because reindexing is parallelizable where as update statistics isn't. (AFAIK, MS did not fix that in sql 2008.)

Sql-server – Why wouldn’t statistics update if all rows in a table were replaced

Unfortunately after searching SolarWinds' KB for about 10 minutes I cannot find what "Ineffective Statistics" measures. That information would be helpful. However, based on your question I'll infer that it means that the Statistics objects may be out of date or have taken quite a few modifications and Database Performance Analyzer has not detected an update of some sort. Without more information from the product vendor, it's hard to tell.

Having said that, there are a few things that you should know.

First, you can verify what's going on with your server right now (e.g. when you get the alert) and compare it against your monitoring tool. Here's a script that I use:

use [mydatabasename];
go
select 
    [schema_name] = sh.name, 
    [table_name] = t.name, 
    [stat_name] = s.name,
    [column_names] = stuff((
                              select ','+index_col(t.name,sc.stats_id,sc.stats_column_id)
                              from sys.stats_columns sc
                              where sc.object_id = s.object_id
                                and sc.stats_id = s.stats_id    
                               order by sc.stats_column_id        
                              for xml path('')), 1, 1, '' 
                            ) ,
    s.stats_id,     
    sp.last_updated, 
    sp.rows, 
    sp.rows_sampled, 
    sp.unfiltered_rows, 
    [sample_rate] = (100.*sp.rows_sampled)/sp.unfiltered_rows,
    modification_counter,
    [default_mod_threshold] = case when sp.unfiltered_rows < 500
                                                   then 500
                                                   else ( .20
                                                          * sp.unfiltered_rows
                                                          + 500 )
                                              end ,
    [2371_mod_threshold] = sqrt(1000 * sp.unfiltered_rows),
    [default_mod_hit] = case when sp.modification_counter > case
                                                              when sp.unfiltered_rows < 500
                                                              then 500
                                                              else ( .20
                                                              * sp.unfiltered_rows
                                                              + 500 )
                                                              end then 1
                                           else 0
                                      end ,
    [2371_mod_hit] = case when sp.modification_counter > sqrt(1000
                                                              * sp.unfiltered_rows)
                                          then 1
                    end

from sys.stats s 
join sys.tables t 
    on s.object_id = t.object_id
join sys.schemas sh
    on t.schema_id = sh.schema_id
cross apply sys.dm_db_stats_properties(s.object_id,s.stats_id) sp
order by t.name, s.name;

Most of it is self-explanatory. However, I added the modification threshold for a statistics object at default and with TF 2371 included. The default threshold is an interesting calculation, while TF 2371 is the square root of 1000 * Table Cardinality. I included a calculation to include whether or not each of these thresholds had been hit yet. So run that inside the alerted database and see if what SolarWinds is seeing comports with what is occurring on the server right now. If it isn't then there's probably a timing issue in there somewhere such that the alert was generated before a stats update but after a massive amount of data change.

Having said that, it's important to know how AUTO_UPDATE of statistics actually works. There isn't a clock-like sweep that goes through and constantly updates statistics. Instead, the statistics are updated when the statistic object is loaded to compile a query plan. If the statistic is rarely/never used then it won't be updated unless you manually update it or if it happens to get caught up in an index rebuild as a dependent object for the index. However, if you have an index that's never getting used then that's a different situation altogether.

So in summary:

Verify what the alert is telling you
Know the factors that lead to a stats update

Best Answer

Related Solutions

Sql-server – UPDATE Statistics with full scan

Sql-server – Why wouldn’t statistics update if all rows in a table were replaced

Related Question