SQL Server – Why is SAN Replication Causing Random Database Corruption?

availability-groupsreplicationsansql-server-2008-r2sql-server-2012

Having a strange issue with a disaster recovery technology we're trying to implement. The environment in both datacenters is the same having VMWare and Dell Equallogic SANs, which are the same versions.

When we replicate from one datacenter to another, random databases are getting corrupted somehow and end up in SUSPECT mode. Each time we try this method, different databases will get corrupted. Is this a behavior in SQL that is causing this? Is this the software used in the SAN to replicate causing these errors?

I've been able to change the state of the database to EMERGENCY mode and perform DBCC CHECKDB, but it is a different problem and database each time. Some of the errors I've found are index issues and data mismatch problems. I'm still in the process of checking other databases to see if I can find a pattern. If find other things, I'll be sure to post if it'll help.

I've heard of people implementing this procedure successfully and it is the last task in the project to figure out before we can close the project charter.

I was really hoping that we could just use the built-in features of SQL server like mirroring or AO-AGs.

The versions of SQL are 2008 R2 and 2012. We're in the process of installing some brand new SQL 2014 servers. Also, they're all Standard, not Enterprise.

Any input or things I could try would be of great help, thanks in advance!!

Edit#1 8/6/15 12:50 PM CST
The following are some of the error messages I've found in the Windows Event Viewer, which is more or less what DBCC CHECKDB produced.

  • EventID 605 – Attempt to fetch logical page (1:22620) in database 26 failed. It belongs to allocation unit 72057594239385600 not to 72057594249412608
  • EventID 824 – SQL Server detected a logical consistency-based I/O error: incorrect paged (expected 1:1230; actual 0:0). It occurred during a read of a page (1:1230) in database ID 58 at offset 0x0000000099c000 in file 'D:\Mydatabase.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatened database integrity and must be correct immediately. Complete a full database consistency check (DBCC CHECKDB).
  • EventID 7886 – A read operation on a large object failed while sending data to the client. A common cause for this is if the application is running in READ UNCOMMITTED isolation level. This connection will be terminated.
  • EventID 608 – No catalog entry found for partition ID 72057594383564800 in database 23. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption.

Edit#2 8/6/15 2:24 PM CST – Received info that restoring .bak files of the databases in SUSPECT mode fixes the matter.

Best Answer

in relation to your comment, I suspect a Ops related issue here instead of a SQL Server engine issue. These SAN devices usually work on the block layer and some manage transaction log/data file sync better than others, as well as other areas.

You can show the ops team that no, SQL Server doesn't randomly corrupt data like this. You can restore backups to another server, setup mirroring, and all this happens with no corruption. The minute we do san level replication it happens. If SQL Server caused corruption like this it wouldn't be around. SQL Server has almost millions of lines of code dealing with corruption, fixing corruption, and reducing the potential of corruption. You don't get this issue in any other environment and it only comes up with SAN replication, correct?

Firmware is often a major cause of these types of issues. Get your Dell support rep on the line, they will have a lot more info and troubleshooting. Don't settle for a lazy rep, your enterprise's data and time is on the line. They have a lot of tools that check to see what is causing this in the background and other tools such as DPAC which might help. This is not a SQL Server engine issue, we will need the full support of Ops.

Edit: If your firmware is out of date or mismatched, get a policy from the Ops team that manages the SAN which states that they will keep the firmware across the stack of the machines they manage up to date. If this SLA doesn't exist, you should make a note of it to your managers because you're exposed to a lot of other issues besides this one.

I'm assuming you're using SAN block level replication.

It could often also be mismatch in settings. Maybe different block sizes, etc. but the san os should be able to detect those issues usually.