SQL Query Optimizer – Resource Availability and Cost Variation

sql-server-2008-r2

When the query optimizer calculates the cost of an execution plan, can you expect the plan to differ depending on the amount of resource available to it?

Take the following scenario:

  • You have a VMware virtual machine running SQL Server 2008 R2 with 32GB RAM, 8 CPU cores hooked up to a SAN. There are 7 VMs sharing the host box.
  • You find CPU usage is peaking at 130% regularly, so you move most of the VMs off the host, leaving only 3 of the original 7 VMs including the SQL Server box.
  • You immediately find that actual performance is slower now that more CPU headroom is available to it. You can see that CPU utilization on the guest is going much higher (because more CPU resource is available for use) than the 30% it could muster before. Despite this, users of the application see a marked slowdown in performance.

My interpretation of why a SQL VM performs slower when more CPU resource is available to it is that the procedure cache contains plans calculated under/optimized for an environment with less CPU availability. When the extra resource becomes available, the procedure cache will provide sub-optimal plans which may perform less well.

Does this make sense? We had this exact scenario a couple of weeks back and had lots of complaints about performance as soon as we stopped the CPU thrashing on the host. I ran dbcc freeprocache and thereafter performance seemed to improve – either that or the users got used to it.

Thoughts?
Thanks

Best Answer

First, some background information

So far (until SQL Server 2014) the query optimiser will only take a few things into account:

  • The number of CPU cores available on the system
  • The current amount of memory in use

CPU cores: The core count assigned to SQL Server from the affinity mask determines how efficient it can possibly be to create a parallel plan. For example, on a system with 4 cores, you can expect a 4x speedup when switching to a parallel plan, whereas a system with 2 cores only get a doubling. Because of this, the core count can affect the choice of plan. Unfortunately, not every query can scale linearly with the core count, yet the optimiser will believe it does. This means that on machines with more than around 12-16 cores, getting more parallelism will actually SLOW the query down. The speed of the CPU is not taken into account, only the number of cores.

Memory Available: When the plan is made, the amount of memory available is taken into account. This determines strategies like hash joins, sort space and other memory intensive operations. Mis-estimation here is very dangerous and can lead to poor performance. Especially if you over estimate the memory available for a hash join and have to spill into tempdb.

Your specific case

Without measurements of your system it is difficult, if not impossible, to know exactly what happened in your scenario. There are too many variables that potentially changed at the same time. It may simply be that something else has changed in the environment. Any diagnosis is pure guesswork and real DBA work is a science, not an arty exercise in guessing.

You would need to collect the following to get knowledge.

  • The query plans before/after
  • Wait stats
  • Exact machine configuration (both for the host and the VM in the virtualised scenario)