Sql-server – SQL Server recreating plans each day

execution-plansql serversql-server-2008statistics

We have this problem in our production environment.

Microsoft SQL Server 2008 R2 (SP1) – 10.50.2500.0 (X64) – Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1).

SQL Server is dropping all (almost 100% of) the old execution plans and recreating them everyday overnight (from 11:00 PM to 8:00 AM). This was even happening when the 'auto update stats' was at disabled state. We've turned the 'auto update stats' on for the last 2-3 weeks. But it's still happening.

We don’t really know what triggers this re-generation of plans but we are sure we don’t do it manually.

The only thing that really coincides with the timing of the plans being regenerated is a DB maintenance job that we have: the daily index reorganizing (when fragmentation is 5-30%), and the daily index rebuilding (when fragmentation is more than 30%) job. Usually this daily maintenance job only does reorganizing (as the index fragmentation is never more than 30% on daily basis).

Impact:

These newly created plans make some UDF calls/query calls (that are called from UI/web pages) take way longer (minutes as opposed to less than 1 second), and so the sessions just get piled up taking the CPU close to 90%.

The problem goes away the moment those stuck sessions are forcefully deleted (on the DB side), and 1) when all the corresponding execution plans are cleared manually (for queries) or 2) when the UDFs are altered (for functions). Any new plans created by SQL server from that moment work just perfect throughout the day until it ends up having the same issue next morning. Also, this behavior is not 100% consistent, we're not really seeing it each and every morning. But there have been periods of time where we've been seeing it consistently for 4-5 days in a row.

The problem happens on business mornings, that's when UI/web pages are accessed more intensively, it seems.

Does anyone have a clue what is causing this and how to solve this problem? Any help would be much appreciated.

Best Answer

Well I have some ideas which could cause this behavior.

  1. Do you monitor your memory pressure? Maybe your queries raise a certain limit which will cause the flush of the plan cache. I don't know your application, but does this correspondent with your logs from your frontend servers? Is there pressure too during this time?
  2. Do you have a dedicated SQL Server or does the server share it's hardware with other processes/services? If not, try to consider to outsource your SQL Server to a dedicated machine instead. This will reduce side effects from other services.
  3. You may want to use optimize for ad hoc workloads, which will just save a plan stub and compile it if it's needed. This will reduce the load of your plancache, which will lower the chance of an plancache flushing. You can enable it using sp_configure 'optimize for ad hoc workloads',1; reconfigure. This can be done if you have enabled the advanced options using sp_configure 'show advanced options',1; reconfigure.
  4. Another idea can be backups. Just simple backups. If they are aggressively, it may occur that your machine gets under pressure too. The time your mentioning just sounds like a good timespan for planning a backup.
  5. Maybe it's quite simple a bug in your maintenance script. Have you checked if there is a logical issue which causes your script to rebuild all indices instead of just those who matches the criteria. This maybe can cause it too.

Just beside all of this possibilities, it may be useful to check the log files for some changes to the options affinity mask, affinity I/O mask and their x64 partners. Another thing can be a change of the MAXDOP option of your instance. Please check the logs for them too. They will need to flush the plancache too.

Last but not least, you can still run a serverside trace (just setting it up using the profiler, start it, stop it and using sql command to start it again on the serverside). Beside that perfmon is your friend. It can watch and monitor your performance values for a time. Maybe you can see parallels in the pressure with certain actions on your server which may cause those flush.

Hopefully this will help you, even if the answer come a bit later.