Virtualized SQL Server – 100% CPU and 0 Active Memory Issue

cpumaxdopmemorysql-server-2016

We recently experienced an odd crash with one of our server. We know what caused it, but we don't know why SQL Server responded the way it did.

SQL Server 2016

Here is what happened:

Friday: Changed MAXDOP from 4 to 2 without checking how much of an impact this would have on compiled queries.

Saturday Morning: Server progressively uses more CPU starting at 2 AM until it hits 100% at 5 AM and Active RAM immediately drops to 500MB from 44GB (what?!?!). Server is rebooted, all seems well.

Monday Morning: at 1:05 AM when one of the nightly batch job starts, the server hits 100% CPU and drops active memory to 500 MB by 1:15 AM.

There are a significant number of large queries that run every morning on this server as part of our nightly batch processing. Looking at the Query Store I can see they were all compiled for 4 cores before the change to MDOP on Friday.

So, all the large queries that used to have 4 cores had to be recompiled for 2, which made all of them much slower during one of the busiest time frames for this server.
I understand how this would significantly affect performance on the server, but I don't understand why the server became completely unresponsive, and had so little active memory.

Does anyone have some experience with this kind of behaviour?

Best Answer

Check out this interesting post from David Klee: VM Memory Counters Lie for SQL Server VMs

In particular, this quote is very relevant to your situation:

SQL Server uses memory as a working set I/O buffer, and memory blocks in the SQL Server buffer pool are most commonly used as a read cache. Memory blocks that are repeatedly read only show up in the active counters if that block has been read from or written to in the last sample interval, which is usually 20 seconds. Therefore, SQL Servers that have an appropriate (or too high, but that’s a different story) amount of memory assigned to the VM will have most of the commonly accessed data in memory, but those blocks are rarely accessed in their entirety over that short of a sample window. The VM active memory counters will show a very low active value, but this SQL Server could actually be very busy.

This explains your first issue (related to "active" memory). In essence, the metric you were looking at isn't reliable, so it doesn't tell us much of substance about the situation.

The other question, related to unresponsiveness, is harder to pin down without seeing wait stats from the time of the outage, or specifics of the recompiled execution plans.

If I were going to speculate, a plausible theory might be that:

queries now running at MAXDOP 2 (instead of 4) take significantly longer to complete
queries are arriving at the server at the same rate as before
so now there are more queries running at once (since they take longer to complete)
the server starts to experience thread starvation (threadpool waits) because there are so many concurrent queries running
if the queries require memory grants, they could also start queuing up for memory (resource_semaphore waits)

All of these things would combine to make your server seem unresponsive. It might also cause the "active memory" in the VM to drop, since memory pages might be accessed less (due to everything waiting on CPU resources).

Related Solutions

Sql-server – SQL Server 2012 Standard using too much memory

I don't think DBCC MEMORYSTATUS (or the other queries) are being that revealing at the moment (don't know what the others think); so I am going to ask you to run just one more:

dbcc traceon(3654, -1)
go
waitfor delay '00:05:00'
go
select top 20 memory_object_address, source_file, sum(size_in_bytes) as total_bytes
from sys.dm_os_memory_allocations
group by memory_object_address, source_file
order by total_bytes desc
go
dbcc traceoff(3654, -1)
go

You can see I've built in a delay there to allow it time to collect some information. Depending on what that reveals, I think you are faced with a choice, continue to investigate or speculate:

continue to drill further into the memory leak using more advanced debugging techniques, eg using sysinternals tool Process Explorer; find process sqlservr.exe and look for evidence of memory leak, or:
use our instincts and what we know about your SQL Server and make and educated guess this is most likely to do with the MySQL linked server. Next time you restart your service, temporarily disable this bit of functionality, eg drop the linked server and/or disable reports which depend upon it. Monitor for your issue. If the issue does not occur, you have a good candidate for your culprit. I think there are newer drivers available (5.3.4?) or you could consider refactoring this part of the application to an SSIS job which creates a local cache once a night for example. If the problem does not occur you at least rule it out.

As this is a live server I think your debugging options are a bit limited so it might be time to try some options, see how the server responds.

Just in reference to using DBCC MEMORYSTATUS, I have had some success with this technique in the past, for example identifying run-away full-text queries or use of OPENXML without the compulsory sp_xml_removedocument.

MySQL – InnoDB Memory/CPU Optimization for WordPress

Sounds like you need an index

ALTER TABLE wp_posts ADD INDEX type_status_date_ndx (post_type,post_status,post_date);

This index is needed because the query has post_type and post_status with static values, while the post_date has a range that that is already ordered.

As for CPU usage, please add this to my.cnf (Restart required)

innodb_read_io_threads=8
innodb_write_io_threads=8

See my post Possible to make MySQL use more than one core? and About single threaded versus multithreaded databases performance for more info

Best Answer

Related Solutions

Sql-server – SQL Server 2012 Standard using too much memory

MySQL – InnoDB Memory/CPU Optimization for WordPress

Related Question