We recently experienced an odd crash with one of our server. We know what caused it, but we don't know why SQL Server responded the way it did.
SQL Server 2016
Here is what happened:
Friday: Changed MAXDOP from 4 to 2 without checking how much of an impact this would have on compiled queries.
Saturday Morning: Server progressively uses more CPU starting at 2 AM until it hits 100% at 5 AM and Active RAM immediately drops to 500MB from 44GB (what?!?!). Server is rebooted, all seems well.
Monday Morning: at 1:05 AM when one of the nightly batch job starts, the server hits 100% CPU and drops active memory to 500 MB by 1:15 AM.
There are a significant number of large queries that run every morning on this server as part of our nightly batch processing. Looking at the Query Store I can see they were all compiled for 4 cores before the change to MDOP on Friday.
So, all the large queries that used to have 4 cores had to be recompiled for 2, which made all of them much slower during one of the busiest time frames for this server.
I understand how this would significantly affect performance on the server, but I don't understand why the server became completely unresponsive, and had so little active memory.
Does anyone have some experience with this kind of behaviour?
Best Answer
Check out this interesting post from David Klee: VM Memory Counters Lie for SQL Server VMs
In particular, this quote is very relevant to your situation:
This explains your first issue (related to "active" memory). In essence, the metric you were looking at isn't reliable, so it doesn't tell us much of substance about the situation.
The other question, related to unresponsiveness, is harder to pin down without seeing wait stats from the time of the outage, or specifics of the recompiled execution plans.
If I were going to speculate, a plausible theory might be that:
All of these things would combine to make your server seem unresponsive. It might also cause the "active memory" in the VM to drop, since memory pages might be accessed less (due to everything waiting on CPU resources).