Slow SQL Server 2008 R2 Backups After Windows Update

performancesql-server-2008-r2

I recently installed a series of Windows Updates on our SQL Server 2008 R2 server. Note, these were not SQL Server updates, they were just the usual monthly Windows updates. These updates included a couple of re-boots, but otherwise this was just routine maintenance. Since then however, SQL Server has been running at least 4 times slower than usual – backups and queries – and I'm at a loss to figure out why. Backups which used to take 3 hours now take 20. I appreciate there are a myriad reasons why queries can be slow on a restart, but backups?

I've heard there were issues with the Balanced Power Plan when upgrading from SP2 to SP3 with SQL Server 2008 R2, but this server has had SP3 installed for over a year now. I tried changing to the High Performance Plan anyway, but that made no difference to the speed of the backups.

I also checked if there were any pending sp_configure changes that would kick-in (or kick-out) after a restart, but the Error Logs did not show any changes.

Is it possible that the Network Adapters would somehow reset after a re-boot? One of the Windows updates was for the "Intel Ethernet Server Adapter X520-2". Could that be an issue? Would the Jumbo Frames size, or anything other configuration value, reset itself after a driver update?

To give more information about the slow backups, the biggest database I backup is about 3TB, hosted on one Windows drive letter. That Windows drive is made up of six 820GB spindle drives (strictly speaking, it's twelve drives in a RAID-1 formation). Previously, this database would backup in about 3.5 hours at a rate of 192 MB/sec. Now it takes 20 hours at a rate of 32 MB/sec, and I don't understand what's changed. It seems that in the past, data must have been pulled from all six drives simultaneously during the backup to achieve 192 MB/sec, but now I'm guessing the backup is pulling data from just one drive at a time (checking with PerfMon), hence 32 MB/sec. Why? Could this have been caused by a change in the RAID controller, the network adapters, the SQL Server backup command itself, or what? And is it typical for data to be read/written from an array of drives in a parallel fashion? If so, what piece of hardware controls this?

Best Answer

This is a bit of a fishing expedition but here would be my troubleshooting process assuming you don't have your own stats to help clear things up.

  1. Check the history of database size and compressed backup size from your msdb catalogs. Have they changed over the past few weeks.

    • Sometimes an application will go bezerk and fill a database with garbage; what was 100GB is now 1TB. SCOM and Sharepoint are good examples of this.
    • Sometimes someone can go in and turn off compression on your instance or in whatever your backup software is, especially if they are trying to force you to use de-duplication on a SAN.
  2. You need to talk to the staff who are managing the Infrastructure and SAN.

    • If this is a VM they may have loaded the VM server with another host that is chewing up resources.
    • Their VM monitoring software may be logging useful statistics that will help you narrow things down.
    • The SAN people usually have their own built in monitors that may be able to add some detail.
  3. If you're still stuck with no metrics of any kind, go back to your msdb information and use the timings to calculate an MB/s for backups. Now take a copy only backup to NUL (note: this will put a negative load on your system of course) and compare.

    • If your backup to NUL speed is below your previous MB/s speed you'll know your read speed of the database has changed and is likely the bottleneck (though there may be others, it's a starting point).
    • Otherwise you'll know the server itself is fine and that you need to refocus your attention in the direction of network IO and the destination SAN.
  4. If all of this fails for some reason...

    • You can see what updates were installed in Add/Remove Programs and do some Googling on the KBs to see what turns up. I don't think you will find anything but who knows; maybe?
    • Start capturing new metrics. You could use the built in Management Data Warehouse functionality seeing as you have nothing else. You could use a trial of a paid tool.
    • If you still had to do it manually, you could capture the disk IO DMVs (sys.dm_io_virtual_file_stats) using sample scripts from the net to work out your IO wait times, and see if they're abnormally high like over 100ms which would point at a local disk (or SAN) issue, or intensive queries. Then you can use other DMVs (and sample scripts) to pull out intensive queries from the plan cache and see if you can spot anything strange.
    • Have any new staff been added, or a new developer, or even to an existing AD group that has been granted access, or a new login on the server (check sys.server_principals)? I've seen new people point Excel and other tools at a production database and kill it without realising the impact.
    • It's possible, but unlikely, you're in a situation where the first few queries that came in had parameters that created poorly optimised plans which are now doing table scans over and over and dragging down performance. If you got desperate, you can also free your plan cache (seeing as the system is in dire straights anyway) and see what happens. I've never seen this drag a system down before but I have seen it kill individual applications before.

As you can see, if you have no previously recorded information about how the system should be acting, it's a matter of thinking of everything it could possibly be, and working out how you're going to rule that in or out through expensive and time wasting investigation. But them's the breaks.

Related Question