SQL Server – Performance Monitor % Processor Time Metric Differences

data collectionperformanceperformance-monitorsql server

I have SQL Server running on 32-core CPU, and I use Performance Monitor (perfmon.msc) to collect data on Processor: % Processor Time (Instance – _Total) and Process: % Processor Time (Instance – sqlservr.exe)

So for example for day 2019/07/21 (Sunday), it shows average values:

Processor: % Processor Time (Instance - _Total) - 3.959
Process: % Processor Time (Instance - sqlservr.exe) - 107.312

Why is such a difference ?

I expected sqlservr.exe to show values like 2.900 or 3.100 – something like this, but not 107.312

How to correlate _Total to sqlservr.exe ?

I am using MaxDOP limited to 8 server level setting on SQL Server, if this is any help

Best Answer

The Process: % Processor Time (Instance - sqlservr.exe) counter takes into account all of your cores. If you divide by the cores you get an answer much similar to what you expected.

This TechNet article explains it in far more detail but it uses this calculation to get the number you see: (No of Logical Cores * 100), So this is going to be a calculated over a baseline of more than 100.

Workspace Memory

For plans that contain memory-consuming iterators like sorts and hashes, the size of the buffer pool (among other things) determines the maximum amount of memory grant that might be available to the query at runtime.

In SQL Server 2012 (all versions) this number is reported on the root node of a query plan, in the Optimizer Hardware Dependencies section, shown as Estimated Available Memory Grant. Versions prior to 2012 do not report this number in show plan.

The estimated available memory grant is an input to the cost model used by the query optimizer. As a result, a plan alternative that requires a large sorting or hashing operation is more likely to be chosen on a machine with a large buffer pool setting than on a machine with a lower setting. For installations with a very large amount of memory, the cost model can go too far with this sort of thinking - choosing plans with very large sorts or hashes where an alternative strategy would be preferable (KB2413549 - Using large amounts of memory can result in an inefficient plan in SQL Server - TF2335).

Workspace memory grant is not a factor in your case, but it is something worth knowing about.

Data Access

The potential size of the buffer pool also affects the optimizer's cost model for data access. One of the assumptions made in the model is that every query starts with a cold cache - so the first access to a page is assumed to incur a physical I/O. The model does attempt to account for the chance that repeated access will come from cache, a factor that depends on the potential size of the buffer pool among other things.

The Clustered Index Scans in the query plans shown in the question are one example of repeated access; the scans are rewound (repeated, without a change of correlated parameter) for each iteration of the nested loops semi join. The outer input to the semi join estimates 28.7874 rows, and the query plan properties for these scans shows estimated rewinds at 27.7874 as a result.

Again, in SQL Server 2012 only, the root iterator of the plan shows the number of Estimated Pages Cached in the Optimizer Hardware Dependencies section. This number reports one of the inputs to the costing algorithm that looks to account for the chance of repeated page access coming from cache.

The effect is that an installation with a higher configured maximum buffer pool size will tend to reduce the cost of scans (or seeks) that read the same pages more than once more than an installation with a smaller maximum buffer pool size.

In simple plans, the cost reduction on a rewound scan can be seen by comparing (estimated number of executions) * (estimated CPU + estimated I/O) with the estimated operator cost, which will be lower. The calculation is more complex in the example plans due to the effect of the semi join and union.

Nevertheless, the plans in the question appear to show a case where the choice between repeating the scans and creating a temporary index is quite finely balanced. On the machine with a larger buffer pool, repeating the scans is costed slightly lower than creating the index. On the machine with a smaller buffer pool, the scan cost is reduced by a smaller amount, meaning the index spool plan looks slightly cheaper to the optimizer.

Plan Choices

The optimizer's cost model makes a number of assumptions, and contains a great number of detailed calculations. It is not always (or even usually) possible to follow all the details because not all the numbers we would need are exposed, and the algorithms can change between releases. In particular, the scaling formula applied to take account of the chance of encountering a cached page is not well known.

More to the point in this particular case, the optimizer's plan choices are based on incorrect numbers anyway. The estimated number of rows from the Clustered Index Seek is 28.7874, whereas 256 rows are encountered at runtime - almost an order of magnitude out. We cannot directly see the information the optimizer has about the expected distribution of values within those 28.7874 rows, but it is very likely to be horribly wrong as well.

When estimates are this wrong, plan selection and runtime performance are essentially no better than chance. The plan with the index spool happens to perform better than repeating the scan, but it is quite wrong to think that increasing the size of the buffer pool was the cause of the anomaly.

Where the optimizer has correct information, the chances are much better that it will produce a decent execution plan. An instance with more memory will generally perform better on a workload than another instance with less memory, but there are no guarantees, especially when plan selection is based on incorrect data.

Both instances suggested a missing index in their own way. One reported an explicit missing index, and the other used an index spool with the same characteristics. If the index provides good performance and plan stability, that might be enough. My inclination would be to rewrite the query as well, but that's probably another story.

Sql-server – How to calculate required ressources from a baseline and what should be measured at all

The questions of whether/how to directly measure CPU core usage etc. are beyond my understanding, but here's what I'd consider trying:

Run a standard profiler trace with database name added, during your normally busiest period. Total up the CPU column for the SQL:BatchCompleted and RPC:Completed events by database, and you'll get a rough idea of how much CPU resources (which may be spread across multiple cores) each database is consuming. (Perhaps also total up the CPU column for the other events to see if anything major was missed. And save the trace "as trace table" for analysis.)
Exactly how to translate that to how many cores you'll need, I can't say. But if you also measure the total system CPU usage during the profiler run you might be able to estimate against the specific database's ratio of the total.

Note: If your server takes less than a few hundred batch requests a second (see SSMS activity monitor), then a standard profiler trace even across the network will almost certainly not affect performance. And if you instead script a server-side trace then more requests a second can be handled without slowing anything, but I make no promises for your environment.

For RAM, I wonder if http://www.sqlshack.com/sql-server-memory-performance-metrics-part-4-buffer-cache-hit-ratio-page-life-expectancy/ might help you determine if your instances need less/more. I don't think there's any way to do this by database though.

Best Answer

Related Solutions

SQL Server 2008 R2 – Difference in Execution Plans on UAT and PROD Server

Workspace Memory

Data Access

Plan Choices

Sql-server – How to calculate required ressources from a baseline and what should be measured at all

Related Question