SQL Server Statistics – How Does Sampling Work When Updating Statistics?

sql serverstatistics

I have several massive tables. I'd like to ensure their statistics are up to date via a weekly maintenance plan.

However, doing so is taking too much time.

If I specify

WITH SAMPLE 50 PERCENT

does SQL Server then sample:

the first 50% of pages
every other page
or some other strategy?

BOL is not clear on this.

Best Answer

For WITH SAMPLE 50 PERCENT it works as though for each data page in the table SQL Server flips a coin. If it lands heads then it reads all the rows on the page. If it lands tails then it reads none.

Tracing the UPDATE STATISTICS T WITH SAMPLE 50 PERCENT call in Profiler shows the following query is emitted

SELECT StatMan([SC0], [SB0000])
FROM   (SELECT TOP 100 PERCENT [SC0],
                               step_direction([SC0]) OVER (ORDER BY NULL) AS [SB0000]
        FROM   (SELECT [C] AS [SC0]
                FROM   [dbo].[T] TABLESAMPLE SYSTEM (5.000000e+001 PERCENT) 
                WITH (READUNCOMMITTED)) AS _MS_UPDSTATS_TBL_HELPER
        ORDER  BY [SC0],
                  [SB0000]) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)

With Plan

enter image description here

TABLESAMPLE SYSTEM (5.000000e+001 PERCENT) is responsible for the sampling and is documented here

TABLESAMPLE SYSTEM returns an approximate percentage of rows and generates a random value for each physical 8-KB page in the table. Based on the random value for a page and the percentage specified in the query, a page is either included in the sample or excluded. Each page that is included returns all rows in the sample result set.

The documentation also states

Although the plan shows a table scan is performed, only those pages that are included in the result set are actually required to be read from the data file.

The STATMAN call is to an internal aggregate function briefly described here

Related Solutions

SQL Server Index Defragmentation in SAN Environment Benefits

Defragmentation strategies help improve scan speed to/from disk.

The wide variety of opinions is because an environment's ideal defragmentation strategy should depends on many different factors. There are also multiple potential layers of fragmentation in play.

Saying that your databases are stored on a SAN isn't enough information. For example:

Are database files stored on separate physical RAID groups or the same RAID group? What other processes are active on that same device? Are your backup files ending up there, too? You may have to ask your SAN admin for this information, because it's not always transparent.
What are the access patterns for the databases? OLTP is generally random access, but sometimes an application is table-scan-happy and you can't change its behaviour (ISV app). Are the applications read-mostly, write-mostly, or somewhere in between?
Are there performance SLAs in play during a recovery/failover period?

Brent's post assumes there is one giant pool of storage and everything shares it. This means the physical disks are rarely idle, and hence most access is random. If that is your situation, then the advice applies, and I agree with it for the most part. While this type of strategy is much easier to manage, it isn't necessarily (a) what you have in your environment, or (b) what is the best solution for your environment.

If index maintenance is burdensome, consider doing it less aggressively, and/or amortize the cost over the week (i.e., run light maintenance once/day, instead of heavy maintenance once/week).

You can also turn on the SortInTempdb option to potentially reduce the amount of logging that takes place in the user databases.

Sql-server – Why doesn’t running a full sample always improve query performance

A perfect query optimizer should make the best plan if provided with a full sample of statistics. But even a perfect optimizer does not need the information provided by a full sample of statistics to make the best plan. A simple example:

select count(*)
from tab1
where col1=3
and col2=4

Assume that for this table tab1 there exists a nonclustered index on col1. The optimizer has to possible plans:

1) scan the whole table and check for each row if col1=3 and col2=4 2) consult the index to find the position of all rows in the table with col1=3. Lookup each of these row in the table and check if col2=4

now assume the table has a size of 123,456 GB and 1 123 456 789 rows and 567 890 123 of them (about half of them) have col1=3. What should the optimizer propose? It schould propose a full table scan. The other possibility, looking up 567 890 123 rows one by one is too expensive.

Now assume the optimizer only has a rough estimate of the table data: the table size is about 100GB, there are about 1,000,000,000 rows and about 600,000,000 of them have col1=3. Based on this information the optimizer will choose the same good plan as in the first case.

The decision is based on the size of the ratio 100*(# of col1=3-rows)/#(all rows) and if this value is larger 10% (I don't know which value is really used) the full table is scanned. The decision is not based of the size of the 5th place after the decimal point of the ration.

In a special case where the exact ratio is 10,00001% and the ratio based on the rough estimate is 9,99999% the optimizer will choose different plans but maybe for these different plans the execution time may be the same for both plans or may be better for the plan based on the rough estimate

Administrative jobs like calculationg statitistics should be done only with least resource consumption that is possible. Therefore only the precision that is necessary should be tried to be reached.

Best Answer

Related Solutions

SQL Server Index Defragmentation in SAN Environment Benefits

Sql-server – Why doesn’t running a full sample *always* improve query performance

Related Question

Sql-server – Why doesn’t running a full sample always improve query performance