Sql-server – Backup Compression Causing Corruption In SQL 2017 TDE Database

corruptionsql serversql-server-2017transparent-data-encryption

On SQL Server 2017 (CU3), whenever I enable backup compression on one of my TDE databases, the backup process always corrupts a specific page in the database. If I run the backup without compression, it does not get corrupted. Here are the steps I've taken to verify and reproduce this issue:

  1. Run DBCC CheckDB on database "TDE_DB1"; all is good, no errors;
  2. Successfully back up database without compression; RESTORE VERIFYONLY says all is good;
  3. Successfully restore database as "TDE_DB2"; all is good, DBCC CheckDB shows no errors;
  4. Successfully back up "TDE_DB1" database WITH compression; RESTORE VERIFYONLY errors, saying "Damage to the backup set was detected";
  5. Attempt to restore database as "TDE_DB2"; errors, saying "RESTORE detected an error on page (1:92454) in database"
  6. Repeat steps 1-3; all is good;
  7. DROP "TDE_DB1" and "TDE_DB2"; Restore "TDE_DB1" from backup; all is good;
  8. Repeat steps 1-5; get same results;

To summarize: The database and regular backups seem fine, running CHECKDB on the database and VERIFYONLY on the backups do not report any errors. Backing up the database with compression seems to cause the corruption.

Below are the code samples with errors. (Note: MAXTRANSFERSIZE is required for using compression with a TDE database)

-- Good, completes with no corruption;
BACKUP DATABASE [TDE_DB1] TO DISK = N'E:\MSSQL\Backup\TDE_DB1a.bak' WITH CHECKSUM;
RESTORE VERIFYONLY FROM DISK = N'E:\MSSQL\Backup\TDE_DB1a.bak' WITH CHECKSUM;

RESTORE DATABASE [TDE_DB2]
FROM DISK = 'E:\MSSQL\Backup\TDE_DB1a.bak'
WITH MOVE 'DataFileName' to 'E:\MSSQL\Data\TDE_DB2.mdf'
,MOVE 'LogFileName' to 'F:\MSSQL\Log\TDE_DB2_log.ldf';


-- Bad, I haz corruption;
BACKUP DATABASE [TDE_DB1] TO DISK = N'E:\MSSQL\Backup\TDE_DB1b.bak' WITH CHECKSUM, COMPRESSION, MAXTRANSFERSIZE = 131072;
RESTORE VERIFYONLY FROM DISK = N'E:\MSSQL\Backup\TDE_DB1b.bak' WITH CHECKSUM;
-- ERROR
--Msg 3189, Level 16, State 1, Line 1
--Damage to the backup set was detected.
--Msg 3013, Level 16, State 1, Line 1
--VERIFY DATABASE is terminating abnormally.

RESTORE DATABASE [TDE_DB2]
FROM DISK = 'E:\MSSQL\Backup\TDE_DB1b.bak'
WITH MOVE 'DataFileName' to 'E:\MSSQL\Data\TDE_DB2.mdf'
,MOVE 'LogFileName' to 'F:\MSSQL\Log\TDE_DB2_log.ldf';
-- ERROR
--Msg 3183, Level 16, State 1, Line 7
--RESTORE detected an error on page (1:92454) in database "TDE_DB2" as read from the backup set.
--Msg 3013, Level 16, State 1, Line 7
--RESTORE DATABASE is terminating abnormally.

I then tried to check the page that is reported as having the error (It is always the same page.), but DBCC PAGE reports that the ObjectId is 0. According to this article by Paul Randal that means there was no metadata found, and one of the reasons could be that the page itself is corrupt and incorrect values were used to try to look up the metadata. His advice is to run CHECKDB, which I cannot do because the corrupted backup won't restore.

I tried the suggestions from this SO Post (Adding INIT and FORMAT to the BACKUP command) to reset the metadata, but that did not seem to change anything, I still get corruption on the compressed backup.

This only happens with one of my TDE database. I have 4 other TDE databases on this same server, and they do not have this problem. That tells me that there may be an underlying problem with this specific database. I realize that the easy solution is to just not use compression, but I feel like this may actually be an early warning to a bigger problem coming down the road.

Has anyone ever seen this before, or have any idea why compression would corrupt that page? At this point, I'm kind of at a loss as to what to do next. I considered restoring the page from an earlier backup, but I don't think that would matter since the page in the regular database seems fine.

UPDATE 1:
Below are the results from DBCC PAGE, with option 0:

DBCC execution completed. If DBCC printed error messages, contact your
system administrator.

PAGE: (1:92454)

BUFFER:

BUF @0x000002187AE55640

bpage = 0x000002184865E000 bhash = 0x0000000000000000
bpageno = (1:92454) bdbid = 8 breferences =
0 bcputicks = 563 bsampleCount = 1
bUse1 = 51429 bstat = 0x809 blog = 0x15a
bnext = 0x0000000000000000 bDirtyContext = 0x0000000000000000
bstat2 = 0x0

PAGE HEADER:

Page @0x000002184865E000

m_pageId = (1:92454) m_headerVersion = 111
m_type = 189 m_typeFlagBits = 0x2d m_level = 197
m_flagBits = 0x525e m_objId (AllocUnitId.idObj) = 788815194
m_indexId (AllocUnitId.idInd) = 515 Metadata: AllocUnitId =
145011308798541824 Metadata: PartitionId
= 0 Metadata: IndexId = -1 Metadata: ObjectId = 0 m_prevPage = (32842:1881351155) m_nextPage = (13086:-560562340)
pminlen = 36067 m_slotCnt = 8149 m_freeCnt = 51871
m_freeData = 7295 m_reservedCnt = 4810 m_lsn =
(742012401:720884976:30191) m_xactReserved = 14755
m_xdesId = (12811:1559482793) m_ghostRecCnt = 12339
m_tornBits = -1381699202 DB Frag ID = 1

Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED
PFS (1:88968) = 0x0 0_PCT_FULL DIFF (1:6) = NOT CHANGED
ML (1:7) = NOT MIN_LOGGED

If I try to run DBCC PAGE with other options, I get the below errors:

DBCC PAGE with option 1: Msg 0, Level 11, State 0, Line 0 A severe
error occurred on the current command. The results, if any, should be
discarded.

DBCC PAGE with option 3: Msg 2514, Level 16, State 5, Line 3 A DBCC
PAGE error has occurred: Invalid page type – dump style 3 not
possible.

UPDATE 2:
Here are some of the results from the sys.dm_db_database_page_allocations DMO:

object_id = 75
index_id = 1
rowset_id = 281474981625856
allocation_unit_id = 281474981625856
allocation_unit_type = 1
allocation_unit_type_desc = IN_ROW_DATA
extent_file_id = 1
extent_page_id = 92448
allocated_page_iam_file_id = 1
allocated_page_iam_page_id = 104
allocated_page_file_id = 1
allocated_page_page_id = 92454
is_allocated = 0 is_iam_page = 0
is_mixed_page_allocation = 0

Best Answer

It looks like this issue is with databases that have had SHRINK operations run on them. In my case, I was taking a copy of one of our production databases on SQL Server 2014 (which is already encrypted with TDE), running DBCC SHRINKFILE on both the data and log files, then taking a backup and restoring it on my new SQL 2017 Server. (The reason for the shrink was to reduces the size to make transferring the backup faster.)

As a test, I restored a copy of the database that I did not run DBCC SHRINKFILE on, and it did not have the corruption problems when compressing backups.

So, to summarize, the results of my testing are as follows:

  • Normal backup/restore operations on this “shrunken” TDE database work correctly in SQL 2017
  • Compressing backups of the “shrunken” TDE database seem to cause corruption in the sys.sysmultiobjrefs table
  • Compressing backups of the regular TDE database (not having DBCC SHRINKFILE run) work correctly and do not report corruption

I do not know if this is a confirmed bug in SQL Server 2017, but I have sent my findings to Microsoft for them to look over.

So, the moral of this story is: DO NOT SHRINK YOUR DATABASES! EVER! :)