SQL Server 2008 R2 – Does VMWare VMotion Affect Performance?

sql serversql-server-2008-r2vmware

We've been experiencing strange issues for awhile now with our virtual environment running SQL Servers.

We randomly get calls from users about very poor performance on the SQL boxes. Sure enough when I look I see the CPUs are pegged at 100%. I perform a VMotion to another host and as soon as it finishes moving over to another host performance immediately returns to normal.

I've been working with the VMWare admins and they have assured me that VMotion would not affect anything on the SQL Servers. It's almost as if the move to another host is causing an execution plan change or the like. I don't understand, however, why out of nowhere CPU usage jumps through the roof unless it's a bad query plan recompile due to parameter sniffing, but I would think that a VMotion would not fix that since it's supposed to be transparent.

The VM farm is comprised of 19 Dell servers (sorry I don't know the exact model) with 2 physical sockets and 12 cores on each socket.

Has anyone else observed this behavior before? I'm wondering if it's something to do with capacity as there are some large VMs for the hosts to handle it seems (there are 14 80GB, 12 core VMs floating around). Even with those VMs on the farm I can see in the Vsphere console that the hosts aren't being over-utilized (memory does creep up to the 80% mark a lot of the time, but no ballooning).

Also, this occurs on all different versions of SQL (2008, 2008R2, 2012, and 2014).

Thanks a lot for any insight!

Best Answer

VMWare VMotion isn't going to reboot your server, restart any services or drop caches. The VM stays live during the VMotion, so you shouldn't lose cache or plans, unless the host you are moving to is under severe memory pressure and ballooning is active.

What does happen during VMotion is increased network latency and maybe a dropped ping while migrating, but that effect is completely gone once the migration is over and should not affect CPU usage inside the guest.

However what you need to understand is that the %CPU use inside the guest is the % you are consuming from the pool of resources that has been allocated to you by the hypervisor (not the underlying CPU) so if you move from a host allotting you 4Ghz to a host allotting 2Ghz the CPU usage inside the guest would double.

There are a few performance counters you could monitor inside the guest VM to see the actual CPU time you are getting from the Hypervisor such as:

  • % Processor Time
  • Effective VM Speed in MHz
  • Host processor speed in MHz
  • Limit in MHz

See here for a start

Which could give you an idea of the actual MHz you are getting, any limitations imposed by the VMWare configuration etc.

If you have determined the Hypervisor isn't allocating enough cycles to your VM you could set reservations to guarantee a certain amount of MHz or add a cpu weight to the VM giving precedence to your VM over others.

If you have access to esxtop (not the flattened sampled averages charts from vCenter) you should keep an eye on %RDY (indicating your VM has threads waiting for a physical cpu) or %CSTP (indicating co-scheduling issues). For more information read through this yellowbrick post

Since you are saying the host is having other high load VM's you also need to consider that VMWare is trying to allocate resources to the most demanding VM's when configured with defaults. A sudden increase in load in another VM could have dramatic (temporary) effects on your VM's cpu allocation.

Unless there are severe memory pressure issues I don't see how you could get cache flushes unless the new host reclaims a lot of the memory through the balloon driver or dynamic memory settings resulting in cache flushes. Otherwise the machine stays live and memory is copied over in lockstep