I have 200 GB data warehouse in SQL Server.
I have been experiencing really slow execution times for some queries; for example 12 hours for a simple delete
query with an inner join
.
After doing some research with the execution plans, I've updated the statistics of the 2 tables involved in the query, using the WITH FULLSCAN
option.
The query now executes in less than a second, so it appears the statistics were not up to date.
I'm considering disabling auto update statistics
on the database and running UPDATE STATISTICS
manually after the data warehouse is loaded. The data warehouse is loaded incrementally from a source ERP system daily, at night.
Am I correct in assuming auto update statistics
in a data warehousing scenarios is not really useful? Instead, does it make more sense to update the statistics manually after the data is loaded?
Best Answer
Here is a whitepaper on when auto_update of stats occurs. Here are the salient points vis-a-vis automatic updates to statistics:
So @JNK made the point in a comment that if you have 1 billion rows in a table, you'd need to have 20,000,5000 writes to the first column in the statistic to trigger an update.
Let's take the following structure:
Now we can check to see what happened in statistics land.
However, to see if this is a meaningful statistic object we need to:
So this statistic hasn't been updated. That's because it looks like the statistic isn't updated until a
SELECT
occurs and even then theSELECT
has to fall outside of what SQL Server has within its histogram. Here's a test script that I ran to test this:Instead of blindly disabling auto_update statistics, I would try to examine your data set for skew. If your data exhibits significant skew, then you need to consider creating filtered statistics and then decide if managing statistics updates manually is the correct course of action.
To analyze for skew you need to run
DBCC SHOW_STATISTICS(<stat_object>, <index_name>);
(in the above script without theWITH STAT_HEADER
) on the particular stat/index combination that you'd like to examine. A quick way to eyeball your skew would be to look at the histogram (third result set) and check out the variance in yourEQ_ROWS
. If it's fairly consistent then your skew is minimal. To step it up, you look at theRANGE_ROWS
column and look at the variance there since this measures how many rows exist between each step. Finally, you can take the[All density]
result from theDENSITY_VECTOR
(second result set) and multiply that by the[Rows Sampled]
value in theSTAT_HEADER
(first result set) and see what the average expectation would be for a query on that column would be. You compare that average to yourEQ_ROWS
and if there are many places where it varies significantly, then you've got skew.If you find that you do have skew, then you need to consider creating some filtered statistics on the ranges that have high very high
RANGE_ROWS
so that you can give an additional steps for better estimates on those values.Once you have these filtered statistics in place, then you can look at the possibility of manually updating statistics.