Sql-server – High CPU and plan cache instability

cpumemoryplan-cachesql server

In the past couple of months my SQL Server instance has been experiencing higher than usual CPU usage than before, at times hitting 100% and affecting an application.

There have been no recent changes to any SQL Server configurations or options in that time, the OS may have in terms of Patching but that is another teams area, and it was not operating like this before the past couple of months.When we have high sustained period of CPU use, I see mostly wait stats of CXPACKET, LCK_M_X, THREADPOOL, LATCH_EX, RESOURCE_SEMEPHOR_QUERY_COMPILE.  CXPACKET for parallelism being by far the largest.

I like to run Brent Ozar’s sp_blitzcache and the one thing that stands out is that the plan cache seems to be cleared by something and 99% of the plans are created in the last hour.  This says this could be memory pressure or plan cache instability.I’m not sure what to check of where to look. 

I have thought of enabling LPIM which isn’t now but besides this not sure what to try and I would be grateful for any help.

Edit**
I had started to log sp_whoisactive into a table. I found that there was a long running process of the application that was running a very intensive denormalisation process on the database, that gradually built and built until it finally starte a stage when it rebuild indexes.

Obviously this is what killed server performance.  We have made sure this does not run again.

However the business has become very twitchy about the appplication/server.  On friday there was a period when again the CPU maxed out to 100% for about 10 mins, unlike before when it lasted longer, this time it stopped after the 15mins. From looking at the results of sp_whoisactive I can see before the CPU got maxed I started to see a high number of RESOURCE_SEMAPHORE_QUERY_COMPILE wait types, these carried on for about 10 mins. I kind of understand what this wait type is, but did this cause the high CPU for the period, as some kind of cumulative effect of threads waiting?

If it was, how can I fix this so that it won't happen again?
I don't have any scope to tune queries/IXs, would adding more memory help?

*****Edit 2 *****
Update on thisIt seems that the inital issue was another team who started to run a Denormalisation process on the database that was really intensive and just killed the server.
After they were stopped doing this I  recieved very few RESOURCE_SEMAPHORE_QUERY_COMPILE wait types unlike before

The situation was much better than before, however CPU was still hitting 100% at times for longer than I wanted and I was still experiencing plan cache instability, it was continually being trimmed. 

After coming across this webpage – https://social.msdn.microsoft.com/Forums/sqlserver/en-US/6b9118f3-b6c1-4f39-ae80-d149392656f9/sql-server-2016-memory-pressure-leads-to-the-plan-cache-clearing?forum=sqldatabaseengineI realised that something at the os level had changed and must have been needing more memory, causing memory pressure and then it taking memory resource away from sql server and the plan cache.

I lowered my instance max server memory setting and things have settled down much more, CPU overall is significantly lower and rarely hitting 100% and the plan cache is a nice size and never being significantly trimmed down.

Best Answer

This DBA.StackExchange answer gives a good place to start tracing the issue.

Specifically the linked Microsoft Docs - Plan Caching in SQL Server 2008 docs (albeit a little dated, but still useful) have a section called Plan stability-related reasons of batch recompilations that details all the reasons the plan cache can be cleared out (you should also read the previous section Causes of Recompilations for some useful info).

There are too many reasons to quote from the above article but the one section I find particularly interesting and might be the easiest place to start is the following:

The following operations flush the entire plan cache, and therefore, cause fresh compilations of batches that are submitted the first time afterwards: Detaching a database, Upgrading a database to SQL Server 2005, Upgrading a database to SQL Server 2008, Restoring a database, DBCC FREEPROCCACHE command, RECONFIGURE command, ALTER DATABASE ... MODIFY FILEGROUP command, Modifying a collation, using ALTER DATABASE … COLLATE command

In my experience, I've personally seen developers schedule commands such as DBCC FREEPROCCACHE to run routinely and cause similar issues. (Additionally I've seen overuse of the OPTION (RECOMPILE) and WITH RECOMPILE clauses all throughout a database causing similar issues as well.)

If you can eliminate the above checklist then you should follow the rest of the guide in the Plan stability-related reasons of batch recompilations docs, maybe looking next at how often you guys are running DDL queries that would cause plan cache recompilation, and then how often are you guys updating statistics across the board.

It's unfortunately going to be an elimination game most likely and won't be a trivial issue to solve, but best of luck.