Potential Issues After Data or Log Disk Failure in SQL Server

azure-vmdisaster recoverysql server

Scenario:

A client recently had an issue with an Azure VM where the (non-OS) 'premium' storage disks were erroneous removed without warning by Microsoft. This is the virtual equivalent to the hard drives being yanked out of a running server.

They have SQL Server 2014 running on this VM with the data and log files placed separately on the two 'failed' drives.

The live, production, non-clustered databases are being used fairly heavily via various services and applications – with the high probability of commands and transactions having bin in-progress or mid-commit when the issue occurred.

After the issue, the VM was forcibly powered-off, and only became 'operational' after a couple of reboots.

Question:

While seemingly no issues have thus far been identified, my question is what are the possible SQL failure modes this could present? Does the transaction log / SQL server architecture account for this type of whole disk failure? Is it possible to have data loss in this situation (without considering corruption of the disk itself). And what checks should/can be performed after such an event?

Best Answer

what are the possible SQL failure modes this could present? Does the transaction log / SQL server architecture account for this type of whole disk failure?

SQL Server uses a write-ahead log (WAL), so any completed transactions should have at least been flushed to disk (unless you are using delayed durability). This means that when your database comes back online, all committed transactions will be replayed from the log file. In-progress transactions will be rolled back as well.

Is it possible to have data loss in this situation (without considering corruption of the disk itself)

Despite the WAL architecture, I'm sure Windows didn't appreciate the abrupt disk loss. It certainly seems possible that data could have been corrupted - for instance, if a single 8 KB page was halfway done being flushed to the physical log file, and the disk was removed at that instant, it seems like the log would be corrupted.

If you're using the Full recovery model, you could likely still do a point in time recovery up to right before the corruption. But there would be some data loss due to the partial write.

And what checks should/can be performed after such an event?

There's really not much to be done other than running DBCC CHECKDB. This will check the database to see if it was corrupted during this unusual event.

Another option would be to restore a backup (from before this event) to a separate database / server, and then use a comparison tool (like SQL Data Compare from Red Gate) to see if there is any data that's different between the backup and the current database.

Related Solutions

Sql-server – Stop SQL Server service(s) before defragmenting drive

I personally have used Raxco PerfectDisk (not associated with the company or any of their employees in any way) to do online defrags of SQL Server LUNs. Works perfectly, if it does slow the server down a bit. I would recommend doing it during periods of lighter activity. When I say "works perfectly" I am referring to it not corrupting the volume or the SQL data files.

The built-in defragmenter does a very poor job if certain structures are fragmented on the drive. PerfectDisk shows you details about all the items that are fragmented including the NTFS allocation tables, directories, alternate file streams, etc. etc.

Does PerfectDisk defragment SQL databases? http://support.raxco.com/KB/a106/does-perfectdisk-defragment-sql-databases.aspx

See this archive copy of the Technet News Magazine regarding SQL Server and fragmentation: http://web.archive.org/web/20100803204458/http://www.microsoft.com/technet/abouttn/flash/tips/tips_083104.mspx

Sql-server – Why are these two INSERTs deadlocking? Is it the trigger? What does this deadlock trace 1222 log tell me

I didn't get a chance to rerun the trace and capture the more nicely formatted XDL file, but in the meantime I was able to resolve the issue by removing all the isolation level and locking hints from the trigger and modifying the application code to run the INSERT itself from within a serializable transaction.

The app uses LINQ to SQL, so the INSERT statement is dynamically generated and run under the default READ COMMITTED isolation level. From the deadlock trace, I could tell that the locks taken by the INSERT were conflicting with the locks taken by the trigger. The deadlock was happening even after removing all the isolation and locking hints from the trigger, so I surmised that the trigger's UPDATE statements were causing stronger locks than the INSERT statement. And when the competing simultaneous transaction tried to do the same thing, the deadlock occurred.

We were definitely on the right track with our previous attempted solutions: using higher isolation levels and stronger locking in the trigger. The problem is that the weaker locks were already taken by the INSERT statement generated by the LINQ to SQL, before the trigger ever got called. The solution is to explicitly start a transaction in our .NET code using SERALIZABLE isolation level, do the insert in that transaction, let the trigger happen in the same transaction (without altering the isolation level or lock hints) and then commit and dispose the transaction. When I did that, I was no longer able to reproduce the issue (I even put it back to the old way and verified that the issue returned.) Here is the new code:

            ' Do the insert in a serializable transaction to prevent simultaneous inserts from deadlocking
            ' due to the trigger, which reaches out to other rows.
            Me.packageContext.Connection.Open()
            Try
                Using oTransaction = Me.packageContext.Connection.BeginTransaction(IsolationLevel.Serializable)
                    Me.packageContext.Transaction = oTransaction
                    Try
                        Me.packageContext.SubmitChanges()
                        oTransaction.Commit()
                    Catch ex As Exception
                        oTransaction.Rollback()
                        Throw
                    End Try
                End Using
            Finally
                ' We need to close the connection and null out the transaction to allow
                ' subsequent uses of the same DataContext to work properly outside this transaction.
                Me.packageContext.Connection.Close()
                Me.packageContext.Transaction = Nothing
            End Try

Normally we avoid opening transactions from within client side code, because they can run long if the code does remote API calls, causing excessive blocking. But in this case we are isolating a very small bit of code, and we have robust error/rollback handling around it.

An alternate solution would be to implement stored procedures that do the INSERTs within explicit SERIALIZABLE transactions, and then alter the DataContext to call the stored procedure. However, that solution is less maintainable as it requires updating both the stored procedure and the DataContext definition every time the schema changes.

Finally, another potential alternate solution would be to examine the deadlock log in detail and find a way to adjust our indexes or keys to avoid the deadlock in the first place. However, that is more difficult and may not be possible with our design.

I think the solution above is the simplest and most understandable for future developers.