Sql-server – SQL Server hangs during backup on EC2

amazon ec2awsbackupsql serversql-server-2012

Once and then our SQL Server will "hang" during its weekly full backup. When this happens, external connections start to timeout. The symptoms are:

  • Connections timeout
  • The sqlservr.exe process sits at usual memory levels (47gb, 12gb left to OS), but ~0-5% CPU (normal = ~30%) with the occasional spike of ~25% for a few seconds, then going back down to 0-5%.
  • I can connect to the database using SSMS, but no queries respond, can't get list of databases, can't access DMVs – while not timing out, I get no results at any point. Same with DAC connection.
  • Nothing else running on the server, it's a dedicated SQL box.

The server is an i2.2xlarge EC2 server with SQL 2012 Standard Edition, running on Windows Server 2012 R2.

Our backup schedule consists of daily differentials and weekly fulls, and this issue has always occurred during the full backup. By simply going through the backup files on disk I can see that it completed a couple of smaller databases (<200MBs) and then never completed one of the bigger DBs (~100GB, typical compressed size = 10GB).

While this is going on, I can remote into the machine just fine, network is good and all disks seem to perform nominally.

The timeline of what happens looks like this:

  • 0300AM: DBCC CHECKDB is run on all databases, all OK.
  • 0345AM: BACKUP is started on the ~100GB database.
  • 0347AM: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [XXX] in database id 7. The OS file handle is 0x00000000000010D4. The offset of the latest long I/O is: 0x000002dbbd0000
  • Repeat ~20 times
  • 0351AM: Time out occurred while waiting for buffer latch — type 2, bp 0000000BE55A9CC0, page 1:1864895, stat 0x40d, database id: 102, allocation unit id: 7493989815628529664, task 0x0000000B91125088 : 0, waittime 300 seconds, flags 0x1a, owning task 0x0000000B91125088. Continuing to wait.
  • Repeat ~5 hours & 500k times

The timeouts are not just from the database being backed up, but from all databases. Usually this backup procedures finishes up in 1-2 hours.

After about 5 hours we give up on it, as all external connections are failing. Attempting to stop SQL Server from the service control panel simply hangs, trying to stop. Attempting to kill the process results in an "Access denied" message – we usually have full access to kill the process, and it's been tested afterwards – the Access Denied message only comes during these events. After some time we have to reboot the machine physically as we can't get rid of the stuck sqlservr.exe process. Once the machine comes up, all is running flawlessly and all CHECKDB's come back clean. We the initiate a new full backup and it completes as expected.

Looking at the Windows events logs, around 0347AM we see "xenvbd" logging entries, though with no description. We also see "disk" logging entries – The IO operation at logical block address 0x14a4e070 for Disk 2 (PDO name: \Device\00000033) was retried. These events continue for the duration – ~5 hours.

Disk 2 is our data drive. Disk 3 is our backup drive, but we see no mentions of it in the log. Both drives are SSD EBS drives.

So far my theory is that the EBS drives get clogged up causing IO to bunch up severely. However, even so, I don't understand why that'd cause SQL Server to somehow get stuck on itself, not responding to anything. When I remote into the server, while this is going on, I can access the disks justs fine. Could a huge backpile of IO requests be causing this? Am I missing some crucial EC2 tuning settings? Any tips on what to do, besides moving off of EC2? I'm pretty sure the root cause is EBS, but seeing as this is working flawlessly 99.99% of the time, I'd really like to find a way to avoid this issue.

Best Answer

We are having a similar problem when we conduct backups. It looks as though the disk reads are maxing out the EBS connection.

I think the problem is down to the Throughput limits on an instance type. Check the max bandwidth column here:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

The only solution seems to be switching to a different instance type with more throughput. Have you found any other solutions?