Virtualized SQL Server – 100% CPU and 0 Active Memory Issue

cpumaxdopmemorysql-server-2016

We recently experienced an odd crash with one of our server. We know what caused it, but we don't know why SQL Server responded the way it did.

SQL Server 2016

Here is what happened:

Friday: Changed MAXDOP from 4 to 2 without checking how much of an impact this would have on compiled queries.

Saturday Morning: Server progressively uses more CPU starting at 2 AM until it hits 100% at 5 AM and Active RAM immediately drops to 500MB from 44GB (what?!?!). Server is rebooted, all seems well.

Monday Morning: at 1:05 AM when one of the nightly batch job starts, the server hits 100% CPU and drops active memory to 500 MB by 1:15 AM.

There are a significant number of large queries that run every morning on this server as part of our nightly batch processing. Looking at the Query Store I can see they were all compiled for 4 cores before the change to MDOP on Friday.

So, all the large queries that used to have 4 cores had to be recompiled for 2, which made all of them much slower during one of the busiest time frames for this server.
I understand how this would significantly affect performance on the server, but I don't understand why the server became completely unresponsive, and had so little active memory.

Does anyone have some experience with this kind of behaviour?

Best Answer

Check out this interesting post from David Klee: VM Memory Counters Lie for SQL Server VMs

In particular, this quote is very relevant to your situation:

SQL Server uses memory as a working set I/O buffer, and memory blocks in the SQL Server buffer pool are most commonly used as a read cache. Memory blocks that are repeatedly read only show up in the active counters if that block has been read from or written to in the last sample interval, which is usually 20 seconds. Therefore, SQL Servers that have an appropriate (or too high, but that’s a different story) amount of memory assigned to the VM will have most of the commonly accessed data in memory, but those blocks are rarely accessed in their entirety over that short of a sample window. The VM active memory counters will show a very low active value, but this SQL Server could actually be very busy.

This explains your first issue (related to "active" memory). In essence, the metric you were looking at isn't reliable, so it doesn't tell us much of substance about the situation.


The other question, related to unresponsiveness, is harder to pin down without seeing wait stats from the time of the outage, or specifics of the recompiled execution plans.

If I were going to speculate, a plausible theory might be that:

  • queries now running at MAXDOP 2 (instead of 4) take significantly longer to complete
  • queries are arriving at the server at the same rate as before
  • so now there are more queries running at once (since they take longer to complete)
  • the server starts to experience thread starvation (threadpool waits) because there are so many concurrent queries running
  • if the queries require memory grants, they could also start queuing up for memory (resource_semaphore waits)

All of these things would combine to make your server seem unresponsive. It might also cause the "active memory" in the VM to drop, since memory pages might be accessed less (due to everything waiting on CPU resources).