Background
Data for the statistics object are gathered using a statement of the form:
SELECT
StatMan([SC0], [SC1], [SB0000])
FROM
(
SELECT TOP 100 PERCENT
[SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000]
FROM
(
SELECT
[TextValue] AS [SC0],
[Id] AS [SC1]
FROM [dbo].[Test]
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT)
WITH (READUNCOMMITTED)
) AS _MS_UPDSTATS_TBL_HELPER
ORDER BY
[SC0],
[SC1],
[SB0000]
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)
You can collect this statement with Extended Events or Profiler (SP:StmtCompleted
).
Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages.
The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics.
Random numbers
SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the Lehmer random number generator with parameter values as shown below:
Xnext = Xseed * 75 mod (231 - 1)
The value of Xseed
is computed as the sum of:
The low integer part of the (bigint
) base table's partition_id
e.g.
SELECT
P.[partition_id] & 0xFFFFFFFF
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1;
The value specified in the REPEATABLE
clause
- For sampled
UPDATE STATISTICS
, the REPEATABLE
value is 1.
- This value is exposed in the
m_randomSeed
element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example <Field FieldName="m_randomSeed" FieldValue="1" />
For SQL Server 2012, this calculation occurs in sqlmin!UnOrderPageScanner::StartScan
:
mov edx,dword ptr [rcx+30h]
add edx,dword ptr [rcx+2Ch]
where memory at [rcx+30h]
contains the low 32 bits of the partition id and memory at [rcx+2Ch]
contains the REPEATABLE
value in use.
The random number generator is initialized later in the same method, calling sqlmin!RandomNumGenerator::Init
, where the instruction:
imul r9d,r9d,41A7h
...multiplies the seed by 41A7
hex (16807 decimal = 75) as shown in the equation above.
Later random numbers (for individual pages) are generated using the same basic code inlined into sqlmin!UnOrderPageScanner::SetupSubScanner
.
StatMan
For the example StatMan
query shown above, the same pages will be collected as for the T-SQL statement:
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) -- Same sample %
REPEATABLE (1) -- Always 1 for statman
WITH (INDEX(0)); -- Scan base object
This will match the output of:
SELECT
DDSP.rows_sampled
FROM sys.stats AS S
CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP
WHERE
S.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND S.[name] = N'IX_Test_TextValue';
Edge case
One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page).
The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (0x7FFFFFFF
= 231 - 1).
We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the REPEATABLE
value. The REPEATABLE
value that will result in the seed being int.max and therefore every page being selected for sample is:
SELECT
0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1;
Working that into a complete example:
DECLARE @SQL nvarchar(4000) =
N'
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE (0 PERCENT)
REPEATABLE (' +
(
SELECT TOP (1)
CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1
) + ')
WITH (INDEX(0));';
PRINT @SQL;
--EXECUTE (@SQL);
That will select every row on every page whatever the TABLESAMPLE
clause says (even zero percent).
Best Answer
Let's break down the command you're running:
Even once a week, this is absurdly aggressive. I know the 5/30% comes from ancient Microsoft advice, but it's just that --ancient. It's time to move on.
You're taking the time to defragment indexes that are 500 8KB pages. That's a 4MB table. If your hardware has issues reading 4MB into memory, or keeping a 4MB table in memory, the answer isn't index maintenance. That's why I crank this up to at least 5000.
Modern hardware like SANs, SSDs, and non-32-bit servers that can house amounts of RAM > 4GB simply don't have the same data access issues that spinning platter drives from the early 2000s had.
We're talking record players vs. CDs.
Index rebuilds should happen extremely infrequently, and only to either correct a problem with an index, or change a setting. Here's the thing: 30% fragmentation isn't a real problem. It's just something DBAs do because they've been told to, and it's something they can measure.
So here's my question to you: how much time and resources does your server expend on index maintenance, and how much does that reduce time and resources consumed by queries?
If you can measure that, you can rebuild and reorg as much as you want.
Kinda sorta. Auto-update stats (even async) happens when 20% of the table data + 500 rows (assuming the table is > 500 rows) changes. If you have a table with a million rows in it, that's 200k rows. You can dynamically decrease that threshold by using Trace Flag 2371, but auto update stats uses the default sampling algorithm, which may not be enough for tables with heavily skewed data.
Well, yeah, I'm answering questions about a server I've never looked at. I prefer nightly stats updates, but I've seen servers that needed them more regularly than that.
So what should you do?
Start by dialing back index maintenance to the commands I posted in the question you're referencing
If you run into problems, decrease fragmentation thresholds until they stop. If no one says anything, run index maintenance less frequently, until maybe, just maybe, you never run it at all.
Update stats nightly
Start with the default threshold. If you find that the full stats update that occurs with a rebuild is necessary, use the
CommandLog
table to find which tables and indexes were regularly being rebuilt, and start focusing on those. They'll typically be indexes for "big" tables, with tens of millions of rows.This is about as specific as I can get about a server I've never seen. You're going to have to take the experimentation from here.
See also my post Why most of you should leave Auto-Update Statistics on