Sql-server – Is it normal to share CPUs on a VM of SQL Server

parallelismsql serversql-server-2016virtualisationvmware

Our IT set up a SQL Server as a VM on a big VMWare box that contains other VMs.
The CPUs are setup as shared. As a result, any queries that might need multiple CPUs take 30x longer than if I limited it to a single CPU. Example:

SELECT TOP 2000 lwa.Message INTO #foo
FROM dbo.LogWidgetsAPI lwa (NOLOCK)
ORDER BY lwa.TimeStamp

vs

SELECT TOP 2000 lwa.Message INTO #foo
FROM dbo.LogWidgetsAPI lwa (NOLOCK)
ORDER BY lwa.TimeStamp
OPTION (MAXDOP 1)  ------------- Force it to run on a single CPU

1st example uses parallelism and takes about 30 seconds or so. 2nd one forces use of a single CPU and take 20 milliseconds.

Note: after running the single CPU query I go back to the running the multi-cpu one and the timing & plan is the same – so I don't think the issue is related to "cold cache" vs "warm cache"

So my theory is that because first query uses multiple CPUs, it must wait till all CPUs in question are idle, and therefore it just waits.

So my question. Should SQL Server VMs have dedicated CPUs or shared ones are normal?

Here is the plan which uses parallelism.
Here is the plan which forces a single CPU.

Best Answer

Is it normal to share CPUs on a VM of SQL Server?

Yeah, it's pretty common. A lot of times VMs are used to consolidate lots of SQL Servers (especially ones that don't have extreme performance requirements) onto one host. This can save on licensing costs, since SQL Server can be licensed at the host level.

Is it a good idea? I mean, it depends a lot on how oversubscribed the VM is, and how CPU intensive the workloads are.


Looking at the screenshots of the two execution plans, they are basically the same except for the parallelism. One problematic area in the parallel plan is the serial zone where the "Top" operator lives:

screenshot of the serial zone in the parallel plan

There is some overhead associated with bringing all the rows together onto one thread, and then redistributing them for the parallel insert. I wouldn't expect that overhead to be 30 seconds though.

So my theory is that because first query uses multiple CPUs, it must wait till all CPUs in question are idle, and therefore it just waits.

No, that's not how parallelism works in SQL Server. The threads that are scanning the clustered index at the top-right of the plan can do very uneven levels of work depending on how busy the different CPUs are.

Now, if this instance of SQL Server is so busy that all of the available threads are being used for other queries, then the parallel query might be waiting on THREADPOOL. Which brings me to my next point:

The parallel query is most likely waiting on some resource. I would start by looking at the "WaitStats" portion of the execution plan in SSMS:

screenshot of waitstats node in SSMS execution plan

That'll be in the "Properties" window of the left-most operator in your plan. As one example, a really high SOS_SCHEDULER_YIELD value in this case could be a sign of this SQL Server instance not getting a turn on the hosts CPU. Jonathan Kehayias has a really good post on that topic here:

CPU Ready Impact on SOS_SCHEDULER_YIELD

You could also compare the ratio of elapsed time to CPU time in the two queries. Those numbers are in the same properties window:

screenshot of time stats node in SSMS execution plan

If the ratio is significantly different between the two queries, that's another sign that the parallel query is waiting on some resource.

If you have access to the host / virtualization, you could look into stats directly there to see if guests are waiting long periods to be scheduled on the CPU. Jonathan has another post about that here, which is specific to VMWare: CPU Ready Time in VMware and How to Interpret its Real Meaning