Sql-server – SQL 2008 R2 performance problems – mixed signals on disk performance

sql serversql-server-2008-r2vmware

I'm having no end of trouble with one of our production SQL boxes – users are experiencing massive drop offs in performance and this is having severe impacts on the business. I've been investigating mainly from a SQL/dba perspective and using sp_BlitzFirst I've been seeing intermittent reports of very slow read/write times (typically ~100-300ms but at points as high as 1000ms+).

The real head scratcher is that after getting our infrastructure guys involved they can't see anything amiss from their end.

The server is a virtual machine (Windows Server 2008 R2, SQL 2008 R2 Standard with 12 cores and 32Gb of RAM providing the databases for a Dynamics CRM 4.0 instance and another web application that is integrated with CRM) running under vmware with storage being provided by a Dell EqualLogic SAN and they are seeing times more in the 30-40ms range. Now it's entirely possible that the samples taken by BlitzFirst and vmware are simply not coinciding and causing the disparity but if not could there be any other explanation? I've used perfmon on the server and while the averages for the disk read/write are pretty reasonable I see definite spikes that are much higher then vmware is reporting – much closer to the figures that BlitzFirst is reporting. Is there anything else I should be looking at?

I'm reasonably certain that the problems BlitzFirst is reporting are real as they always occur at the same points the users experience major problems. To be honest the application performance of this CRM has always been pretty crap but even with a pretty low bar it's still falling well short.

Best Answer

Not wanting to be one of those OPs who just disappears without providing closure here's an update..

Despite receiving multiple assurances from infra that the vhost was in no way overloaded and that contention for physical resources couldn't possibly be the problem they eventually moved multiple VMs off the host to another one and the symptoms disappeared once that was complete. So either the problems were caused by resource contention on the host bottlenecking the VM's performance or the symptoms that just coincidentally looked like resource contention just so happened to coincidentally disappear at the same time that load on the vhost was reduced.

So many thanks to everyone for the helpful comments/pointers!