Sql-server – SQL Server logical consistency-based I/O error

sql serversql-server-2008

I have a job on my live environment DB and the job has failed again this morning reporting SQL Server Error 824 (Soft I/O error, where the OS could read the page but SQL has treated it as corrupt):

SQL Server detected a logical consistency-based I/O error: incorrect checksum it occurred during a read of page (1:9545681) in database ID 5 at offset 0x0000123615a000 in file 'D:XXX\XX.mdf'

I have had a look and ran DBCC CHECKDB with all_errormsgs and saw the error didn't mention CHECKSUM failure, however it reports the repair level needed to repair this error is repair_allow_data_loss.

Is there anything else that can be done other than running the repair SQL Server has recommended since I don't want any data loss?

I have a good backup policy in place, however the backup that was last working when the job ran without any problems has now been deleted due to space issues. Is there anything else I can do now?

Best Answer

First; ensure you have a reliable, recent backup of the database in question. Test the backup by restoring it to some other server, then run DBCC CHECKDB against it to determine if the backup contains the corrupted page(s). If the backup shows no corruption, and DBCC CHECKDB returns a clean bill-of-health for the database, you might consider just restoring the backup if it's recent enough to meet your recovery point objective.

Second, backup the database (while ensuring you don't overwrite any existing backups you may need), and restore it to some other machine. Run DBCC TRACEON (3604); DBCC PAGE (5, 1, 9545681) to see what kind of data is stored in the page - look for m_objId and m_indexId in the output, then check that against sys.objects and sys.indexes. Best case scenario, the data will be part of a non-clustered index that you can simply drop and recreate with ALTER INDEX ... REBUILD. If the data on the page belongs to a clustered index, or a heap, you run the risk that DBCC CHECKDB (repair_allow_data_loss) will cause you to lose access to the entire table, or at least part of the data stored in it.

If you manage to get the corruption issue resolved, you should determine the root cause so you can mitigate the issue, and prevent future corruption. If hardware is an issue, replace it.

Brent Ozar has a great article detailing how to approach this problem on his blog.