SQL Server – Fix 100% CPU Usage on NUMA Node 0

numasql servervmwarewindows-server

The previous Senior DBA left the company and I was made aware of a server which suffers from multiple problems, largely slowness (all the way from SSMS taking a long time to open, long running queries, and failing SSIS jobs (Data Warehouse), to difficulties connecting to linked servers (150+ of them)).

There is most likely way too much happening on this server which contains 5 stacked instances. A new Senior DBA will be joining the team soon but it would be nice to have this all straightened out for him or her by the time they join.

So, the nitty-gritty:

When I was made aware of the issue I found that 94% of the server's memory was allocated to SQL Server. I went ahead and brought that down to 85% by deallocating memory from two instances which were overprovisioned.

I then noticed that MAXDOP on our default instance was set to 4 (possibly 6, I don't recall) along with a CPU Affinity setting. Those CPUs were pegged while there was minimal activity on the others. I went ahead and removed the Affinity setting (since these settings were in place before additional CPUs were added). I set MAXDOP to 20 across all 5 instances.

Currently, I'm still seeing 4 of the CPUs pegged but with an overall avg. usage (across all CPUs) of roughly only 25%.

I've used SysInternal's ProcExp, the Resource Monitor, and the Windows Performance Toolkit to observe the issue but don't really know how to isolate what process(es), specifically, is(are) the root cause. Any recommendations / guidance on how to truly isolate what's happening here? (i.e. specific counters / traces / other programs.)

UPDATE, per requests:

System Info:
Windows Server 2012 R2 Standard
64 GB memory total
20 CPUs

Configuration:
24 GB memory allocated to this instance
26.5 GB memory allocated to other instances (total of 50.5 GB – 78.9%)
Cost Threshold for Parallelism = 50 (across all instances)

I disabled an unused SSAS process.

Best Answer

So you've got 5 'stacked' instances on a single Windows server. You haven't said exactly how many sockets/CPUs are available and how much memory though. I like setting affinity for each instance in such cases, even if I decide to have CPUs overlapping in the struggle to balance the overall CPU load (depends on each instance's load).

Any instance with more than 4 CPUs could use an explicit DOP setting in my experience - rarely over '4' in stacked cases like yours. Don't forget to set 'Cost threshold of Parallelism' for each instance to something reasonable (50?) to avoid excessive parallelism - in your case this is even more important.

Remember that the memory left "for the OS" should be more now, since you have to account for the footprint of each instance (on top of SSIS etc). Check in SQL Config Mgr if SSAS is also running and adjust its 'max memory' accordingly, by default it goes for 80% of the whole server memory (!)

Also maybe worth taking away and 'Lock pages in memory' rights of the SQL Service account(s) so that the OS can breath and do its job better (if it pages, everyone suffers!). Also good practice is to set some reasonable 'min memory' for each instance.

I think running sp_blitz and sp_blitz_first on each instance would give you some quick pointers on more pressing issues.

You may also want to monitor some windows permon counters like 'available memory' and 'working set' for each of the processes running there in case you find particular times of the day/night when the server is suffering.