Sql-server – BACKUP WITH CHECKSUM works but DBCC CHECKDB detects problems

dbcc-checkdbola-hallengrensql serversql server 2014

A server I administer recently experienced a power outage that caused issues with the SSRS ReportServerTempDB. After the outage, the nightly job that runs Ola's DatabaseIntegriyCheck started reporting problems:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d Master 
  -Q "EXECUTE [dbo].[DatabaseIntegrityCheck] 
      @Databases = 'USER_DATABASES,SYSTEM_DATABASES', 
      @LogToTable = 'Y'" -b

Command: DBCC CHECKDB ([ReportServerTempDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130168) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130169) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130170) is missing references from parent (unkn…
Process Exit Code 1. The step failed.

However, our nightly backup job continued to report success every time it ran. SSRS didn't seem to have any problems either. The backup job does have multiple backup commands rolled up into one step:

BACKUP DATABASE [ReportServer] TO  DISK = N'C:\Backup-SQLServer\ReportServer.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT,  NAME = N'ReportServer-Full Database Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10
GO
BACKUP DATABASE [ReportServerTempDB] TO  DISK = N'C:\Backup-SQLServer\ReportServerTempDB.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT,  NAME = N'ReportServerTempDB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10
GO
BACKUP DATABASE [TEST-PH] TO  DISK = N'C:\Backup-SQLServer\TEST-PH.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT,  NAME = N'TEST-PH-Full Database Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10
GO

Step two of our backup job checks the files. It ran every night and reported success each time:

RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServer.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServerTempDB.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\TEST-PH.bak'
GO

This went on for a few days until a convenient downtime. I resolved the issue by restoring/overwriting ReportServerTempDB with a nightly backup from prior to the outage.

Our databases' PAGE_VERIFY_OPTION is set to CHECKSUM.

What scenarios would explain this behavior?

Best Answer

sp_BlitzErik wrote in a comment (since deleted):

...is asking [about the PAGE_VERIFY_OPTION] because of this: Blitz Results: Page Verification Not Optimal. The CHECKSUM option for backups will only throw errors if you're using the correct page verification option. It can sometimes be suboptimal if your database came from an older version of SQL Server.

And it will only throw errors, even with the CHECKSUM option in use for the page verify option, if the page became corrupted after writing to disk. And I don't suppose they're necessarily asking the question just because of that post - my guess is they have the option set to CHECKSUM already.

The scenario described in the question can and does happen when a page corruption happens in memory (either because of bad memory, a memory scribbler, or a SQL Server corruption bug), and then the corrupt page is written to disk with a valid checksum.

Subsequent reads of the page, including testing of the page checksum, will succeed. However, when DBCC CHECKDB (or a hapless query) runs, the corruption will become evident. In this case, it doesn't matter whether the BACKUP and RESTORE statements use CHECKSUM; they won't detect that corruption.

This is why you cannot just rely on using CHECKSUM everywhere and not run consistency checks.

Hope this helps explain what you saw.

From the errors you posted, it looks like the root page of that clustered index got whacked somehow.