SQL Server – Why is SAN Replication Causing Random Database Corruption?

availability-groupsreplicationsansql-server-2008-r2sql-server-2012

Having a strange issue with a disaster recovery technology we're trying to implement. The environment in both datacenters is the same having VMWare and Dell Equallogic SANs, which are the same versions.

When we replicate from one datacenter to another, random databases are getting corrupted somehow and end up in SUSPECT mode. Each time we try this method, different databases will get corrupted. Is this a behavior in SQL that is causing this? Is this the software used in the SAN to replicate causing these errors?

I've been able to change the state of the database to EMERGENCY mode and perform DBCC CHECKDB, but it is a different problem and database each time. Some of the errors I've found are index issues and data mismatch problems. I'm still in the process of checking other databases to see if I can find a pattern. If find other things, I'll be sure to post if it'll help.

I've heard of people implementing this procedure successfully and it is the last task in the project to figure out before we can close the project charter.

I was really hoping that we could just use the built-in features of SQL server like mirroring or AO-AGs.

The versions of SQL are 2008 R2 and 2012. We're in the process of installing some brand new SQL 2014 servers. Also, they're all Standard, not Enterprise.

Any input or things I could try would be of great help, thanks in advance!!

Edit#1 8/6/15 12:50 PM CST –
The following are some of the error messages I've found in the Windows Event Viewer, which is more or less what DBCC CHECKDB produced.

EventID 605 – Attempt to fetch logical page (1:22620) in database 26 failed. It belongs to allocation unit 72057594239385600 not to 72057594249412608

EventID 824 – SQL Server detected a logical consistency-based I/O error: incorrect paged (expected 1:1230; actual 0:0). It occurred during a read of a page (1:1230) in database ID 58 at offset 0x0000000099c000 in file 'D:\Mydatabase.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatened database integrity and must be correct immediately. Complete a full database consistency check (DBCC CHECKDB).

EventID 7886 – A read operation on a large object failed while sending data to the client. A common cause for this is if the application is running in READ UNCOMMITTED isolation level. This connection will be terminated.

EventID 608 – No catalog entry found for partition ID 72057594383564800 in database 23. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption.

Edit#2 8/6/15 2:24 PM CST – Received info that restoring .bak files of the databases in SUSPECT mode fixes the matter.

Best Answer

in relation to your comment, I suspect a Ops related issue here instead of a SQL Server engine issue. These SAN devices usually work on the block layer and some manage transaction log/data file sync better than others, as well as other areas.

You can show the ops team that no, SQL Server doesn't randomly corrupt data like this. You can restore backups to another server, setup mirroring, and all this happens with no corruption. The minute we do san level replication it happens. If SQL Server caused corruption like this it wouldn't be around. SQL Server has almost millions of lines of code dealing with corruption, fixing corruption, and reducing the potential of corruption. You don't get this issue in any other environment and it only comes up with SAN replication, correct?

Firmware is often a major cause of these types of issues. Get your Dell support rep on the line, they will have a lot more info and troubleshooting. Don't settle for a lazy rep, your enterprise's data and time is on the line. They have a lot of tools that check to see what is causing this in the background and other tools such as DPAC which might help. This is not a SQL Server engine issue, we will need the full support of Ops.

Edit: If your firmware is out of date or mismatched, get a policy from the Ops team that manages the SAN which states that they will keep the firmware across the stack of the machines they manage up to date. If this SLA doesn't exist, you should make a note of it to your managers because you're exposed to a lot of other issues besides this one.

I'm assuming you're using SAN block level replication.

It could often also be mismatch in settings. Maybe different block sizes, etc. but the san os should be able to detect those issues usually.

Related Solutions

Sql-server – Msg 2534 : DBCC CHECKDB allocation errors

If the sequence is:

DBCC CHECKDB reported errors.
REPAIR_REBUILD fixed the database, no errors reported.
DBCC CHECKDB reporting errors again.

You probably have failed or failing disks in an array or some other component of the IO subsystem is broken, or breaking. Personally, I'd want off the problem server ASAP!

Take a tail-log backup.
Restore last known good full backup and diff/log sequence to a different server.
Switch service to the second server.
Get a fresh cuppa and start diagnosis on the problem server.

Addendum: Root cause of the problem:

as it turned out, I found that as we are using Diskeeper, we are (for corporate reasons) still using an old version : 14.0.896. With this version we run into messages like :

The operating system returned error 1784(failed to retrieve text for this error. Reason: 15100) to SQL Server during a write at offset 0x000000010a4000 in file 'D:\Application-Data\MSSQL\Data\<Database>.mdf:MSSQL_DBCC17'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Msg 3313, Level 21, State 2, Server <Server>, Line 1
During redoing of a logged operation in database '<Database>', an error occurred at log record ID (2349664:1503:4). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Msg 1823, Level 16, State 1, Server <Server>, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Server <Server>, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Server <Server>, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5030, Level 16, State 12, Server <Server>, Line 1
The database could not be exclusively locked to perform the operation.
Msg 7926, Level 16, State 1, Server <Server>, Line 1
Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details

Found an explanation on Paul Randal's blog here

Link to DiskKeeper Fix found here

Interesting Diskeeper white paper on database defragmentation here

Repair Database Strategy for SQL Server 2008 R2 Standard

Seems like you have a serious database corruption as CHECKDB points it out

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB

The only feasible options are :

Restore from backup or
Repair with data loss => Might leave the database in a transactionally inconsistent state.

In both cases, you will have data loss (provided you have good full and T-log backups before that guy pushed green button instead of blue).

I would go for option 1 => Restore from a know latest good backup !

Refer to IAM pages, IAM chains, and allocation units and IAM page corruption examples

UAT looks like as with success, no data missing

If you regularly have UAT in sync with PROD, best is to use that dataset to restore PROD - unless you are masking PROD data in UAT.

Best Answer

Related Solutions

Sql-server – Msg 2534 : DBCC CHECKDB allocation errors

Repair Database Strategy for SQL Server 2008 R2 Standard

Related Question