Sql-server – Backup detects corruption, but CHECKDB does not

corruptiondbcc-checkdbsql server

I have a database where when I run the backup command

BACKUP DATABASE [MyDatabase] TO     
DISK =  'G:\Backup\MyDatabase_01_01_2018.bak'   
WITH    NOFORMAT, NOSKIP, COMPRESSION, INIT, BUFFERCOUNT = 100

I get the error message

Msg 3043, Level 16, State 1, Line 8
BACKUP 'MyDatabase' detected an error on page (1:745345) in file 'F:\Data\MyDatabase_1.ndf'.
Msg 3013, Level 16, State 1, Line 8
BACKUP DATABASE is terminating abnormally.

I ran a full CHECKDB but it comes back clean. I did notice that the Page Verify option had been set to NONE (not my doing) so I changed it to CHECKSUM and rebuilt all the indexes in the DB to get it to write to all pages and generate checksums. After this the backup still fails and the checkdb still shows clean (so no change).

DBCC CHECKDB('MyDatabase') WITH NO_INFOMSGS, ALL_ERRORMSGS,
DATA_PURITY, EXTENDED_LOGICAL_CHECKS;

from the SQL log:

DBCC CHECKDB (MyDatabase) WITH all_errormsgs, no_infomsgs, data_purity executed by xxx found 0 errors and repaired 0 errors. Elapsed time: 0 hours 21 minutes 46 seconds. Internal database snapshot has split point LSN = 000ab776:0000112f:0001 and first LSN = 000ab776:0000112d:0001.

I ran DBCC PAGE but it errors (doesn't even seem to be returning the right page in the first place). I CAN run it with print option 2 and it returns but honestly I don't know what I'm looking for there.

DBCC PAGE ('MyDatabase',1,745345,3)

PAGE: (3:513793)

BUFFER:


BUF @0x00000003811F8280

bpage = 0x00000000F2D70000          bhash = 0x0000000000000000          bpageno = (1:745345)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 44283                       bstat = 0x809
blog = 0x5adb215a                   bnext = 0x0000000000000000          

PAGE HEADER:


Page @0x00000000F2D70000

m_pageId = (3:513793)               m_headerVersion = 1                 m_type = 2
m_typeFlagBits = 0x4                m_level = 0                         m_flagBits = 0x0
m_objId (AllocUnitId.idObj) = 1075937538                                 m_indexId (AllocUnitId.idInd) = 2
Metadata: AllocUnitId = 633462595911680                                  Metadata: PartitionId = 0
Metadata: IndexId = -1              Metadata: ObjectId = 0              m_prevPage = (3:513795)
m_nextPage = (3:513820)             pminlen = 17                        m_slotCnt = 426
m_freeCnt = 2                       m_freeData = 7338                   m_reservedCnt = 0
m_lsn = (608841:643611:411)         m_xactReserved = 0                  m_xdesId = (0:0)
m_ghostRecCnt = 0                   m_tornBits = 0                      DB Frag ID = 1

Allocation Status

GAM (1:511232) = ALLOCATED          SGAM (1:511233) = NOT ALLOCATED     
PFS (1:744096) = 0x40 ALLOCATED   0_PCT_FULL                             DIFF (1:511238) = NOT CHANGED
ML (1:511239) = NOT MIN_LOGGED

Msg 2514, Level 16, State 8, Line 20
A DBCC PAGE error has occurred: Invalid page metadata – dump style 3 not possible.

Any ideas what I could try next? Server version is

select @@version

Microsoft SQL Server 2014 (SP2-CU11) (KB4077063) - 12.0.5579.0 (X64) 
    Feb 21 2018 12:19:47 
    Copyright (c) Microsoft Corporation
    Developer Edition (64-bit) on Windows NT 6.3  (Build 9600: ) (Hypervisor)

Compatibility level of the DB is 100 (SQL 2008).

Best Answer

This answer is taken from an issue of the SQLskills.com newsletter written by Paul Randal, about "a database which would fail a backup with page checksum errors, but passed a DBCC CHECKDB".

The only time this can happen is when an extent is a mixed extent (where the 8 pages in the extent can be allocated to potentially 8 different allocations units – see here) and some pages are erroneously marked as allocated by the relevant PFS page.

When that happens, DBCC CHECKDB will not attempt to read those pages, as it derives what pages to read from an allocation unit's IAM pages (the first of which lists the pages allocated from a mixed extent). This case is a gap in DBCC CHECKDB's corruption-detection logic.

[Because] DBCC CHECKDB couldn't detect the corruption, it wasn't possible for it to make the repairs needed to fix them. So using DBCC WRITEPAGE, I worked out the changes needed in the allocation status for the erroneously-allocated pages, directly in the PFS page, and it worked!

This was an extremely rare case – it's much more common that a DBCC CHECKDB fails but a backup would succeed.

In my opinion, Paul's resolution is way above and beyond exporting and importing the data like you did, so I think you did the right thing.

Related Solutions

Sql-server – Corrupt DB for a noob

It looks as though you have a problem with your storage subsystem (somewhere from the drivers down to the actual disks, but it could be anywhere in that stack).

The good news:

Most of the corruption is involved in non-clustered indexes. This means that if the underlying table is clean, the indexes can be rebuilt to fix the corruption. Object IDs 1345738380, 1761739862 (this has clustered index corruption as well), 2056692908
It doesn't seem like any PFS/GAM/SGAM/etc system pages were damaged.

The bad news:

There are some pages corrupt in the clustered index. This means there could (most likely will) be data loss without a recent backup. Object ID 1761739862
You don't have a recent backup.
You already ran checkdb with repair_allow_data_loss so there is no way to know what you've already lost or what has already been done which takes you an extra step back.
It looks like there is an issue with the disk subsystem.

Where to go from here:

I'd start by taking a backup with continue_after_error to make sure that you have something of a record. Then restore that backup to an instance on the same patch level as the one it was taken from so that all testing can be completed on a copy of the database and not the database itself.

Use the copy and the older backup of the database you have on a second instance to see what data is lost and what may be able to be manually salvageable. This is time consuming and lengthy, but may be needed. If you're really stuck, call in a consultant to help you with this as corruption is a great thing to cut your teeth on and gives you some semblance of liability.

Check the objects associated with the corruption. If they are tables that don't have any super important data (say dictionary tables that can be rebuilt or don't change much) you might be able to get away with manually fixing the table by using the older backup to script out data.

Check your storage subsystem, make sure you have the latest drivers, firmware, etc. Check for any failed drives in the array/san/nas/etc. Double check Ethernet cables, fibre cables, switches, etc. Find the root cause of the corruption or this may happen again. Run a health check on your storage subsystem/motherboard to make sure something isn't faulty with the hardware or controllers.

Lastly, update your resume.

Sql-server – CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object

My first question would be do you have latest valid backup or not. Since database size is just 6G and also because minimum repair suggested by checkdb is repair_allow_data_loss , which can cause data loss and could also remove business constraints from database, it would be good to restore database from valid backup.

Before restoring the backup you need to check validity of the backup. You can do it using below

restore verifyonly from disk='Backup location'

If it comes out clean you can restore the backup to get back the database. The data loss also depends upon the recovery model of database if it is simple you can only restore full and differential backup if it is bulk logged and Full you can restore full ,differential and log backups to get minimum data loss.

Are you able to take backup of current database with continue after error. Please see if below command works, I guess it would

backup database db_name to disk='Location' with continue_after_error

This would generate corrupt backup of database now restore this backup file ON DIFFERENT SERVER using continue after clause normal restore wont be possible see TSQL Restore for restore command.

After above is done run checkdb within transaction like below on above restored database.

Begin transaction
dbcc checkdb('db_name', repair_allow_data_loss)
--commit

After this you can have look into database how much data checkdb has deleted to recover database. Please note checkdb also removes constraints so I would not ask you to run it. Please read Books Online Document before proceeding

At last please see SQL Server errorlog and event viewer to see why database got corrupted

Best Answer

Related Solutions

Sql-server – Corrupt DB for a noob

Sql-server – CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object

Related Question