SQL Server – Replication Breaks After DBCC CheckDB with Allow Data Loss on Primary

dbccsql servertransactional-replication

We have one way transactional replication enabled on SQL Server 2008 databases. Due to some issues we had to run DBCC checkDB, bringing the database to single user mode. When we complete the DBCC activity and enable replication again, replication won't work. We had to initialize snapshot again for replication to work.

Can you please help me understand why replication is breaking in this scenario? I understand the DBCC drops some rows and indexes in a table, but as per my understanding when we enable replication these changes should be propagated to subscriber without any issues.

When I checked with my senior team member, I was told this is due to LSN mismatch, but I am not able to understand logically.

Any help is much appreciated.

Best Answer

3 important pieces of information are missing to pinpoint exactly what went wrong in your particular scenario:

  1. What where the exact steps you took, from the moment you found a problem until the moment you discovered that replication didn't work?
  2. What was broken, what was repaired, what was lost? All of this information is available in the output of CHECKDB.
  3. What exactly do you mean by: "Replication don't work?" Did you get any errors, or are tables not the same? Is data missing?

If you face database corruption in a replication environment and you don't know exactly how to interpret the repair results OR if your publication is not very big, my advise would be to always reinitialize (if possible) or drop/recreate the publication en subscription.

The reason is that most repair operations performed by DBCC CHECKDB are not replicated to the subscribers.(see demo later on) If you have very large publications AND you know exactly how to read the DBCC CHECKDB output you might be able to manually fix things on a table or row level.

Main reasons for replication failure after publication db corruption

  1. Transaction log corruption

  2. Metadata corruption

  3. User data corruption

General setup

I'm using the following script to setup a publisher, distributor, subscriber on the same server, create a source and destination database and a table with 10k rows and a publication and subscription for this table. I'll use this setup for the next tests:

Version:

        SELECT @@VERSION

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64) 
    Dec 28 2012 20:23:12 
    Copyright (c) Microsoft Corporation
    Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) (Hypervisor)

SETUP SCRIPT:

--Create a source database
CREATE DATABASE [RepSource]
GO
--Create a destination database
CREATE DATABASE [RepDest]
GO

USE [RepSource]
GO

--Create a table that we are going to publish
CREATE TABLE [dbo].[RepUserTable] 
(
    [ID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [TestString] [varchar](4000) NOT NULL DEFAULT (replicate('A',(4000))),
    CONSTRAINT [PK_RepUserTable] PRIMARY KEY CLUSTERED ([ID] ASC)
) ON [PRIMARY]
GO


--fill the table with 10k rows
SET NOCOUNT ON
INSERT INTO dbo.RepUserTable DEFAULT VALUES
GO 10000



--configure distribution
USE master
GO
-- declare local server as distributor
exec sp_adddistributor @distributor = @@SERVERNAME,  @heartbeat_interval= 10, @password = N'Dis@Password'
GO

--install local distribution db
exec sp_adddistributiondb   @database = N'distribution'
GO

--find the default backup directory to use as a working directory
DECLARE @path NVARCHAR(4000)
EXEC master.dbo.xp_instance_regread
            N'HKEY_LOCAL_MACHINE',
            N'Software\Microsoft\MSSQLServer\MSSQLServer',N'BackupDirectory',
            @path OUTPUT, 
            'no_output'

--Make this server a publisher
exec sp_adddistpublisher    @publisher = @@SERVERNAME,
                            @distribution_db = N'distribution', 
                            @security_mode = 1,
                            @working_directory = @path, --Make sure the sql agent account has enough permission in this dir.
                            @trusted = N'false',
                            @thirdparty_flag = 0, 
                            @publisher_type = N'MSSQLSERVER'

GO

--Make this server the subscriber as well
exec sp_addsubscriber       @subscriber = @@SERVERNAME, 
                            @type = 0,
                            @description = N'Subscriber',
                            @security_mode=1

GO


--Adding a publication
use [RepSource]
exec sp_replicationdboption @dbname = N'RepSource', @optname = N'publish', @value = N'true'
GO

exec sp_addpublication @publication = N'TestPub2', @description = N'Transactional publication of database.', @sync_method = N'concurrent', @retention = 0, @allow_push = N'true', @allow_pull = N'true', @allow_anonymous = N'true', @enabled_for_internet = N'false', @snapshot_in_defaultfolder = N'true', @compress_snapshot = N'false', @ftp_port = 21, @allow_subscription_copy = N'false', @add_to_active_directory = N'false', @repl_freq = N'continuous', @status = N'active', @independent_agent = N'true', @immediate_sync = N'true', @allow_sync_tran = N'false', @allow_queued_tran = N'false', @allow_dts = N'false', @replicate_ddl = 1, @allow_initialize_from_backup = N'false', @enabled_for_p2p = N'false', @enabled_for_het_sub = N'false'
GO
exec sp_addpublication_snapshot @publication = N'TestPub2', @frequency_type = 1, @frequency_interval = 1, @frequency_relative_interval = 1, @frequency_recurrence_factor = 0, @frequency_subday = 8, @frequency_subday_interval = 1, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @job_login = null, @job_password = null, @publisher_security_mode = 1
GO
exec sp_addarticle @publication = N'TestPub2', @article = N'RepUserTable', @source_owner = N'dbo', @source_object = N'RepUserTable', @type = N'logbased', @description = null, @creation_script = null, @pre_creation_cmd = N'drop', @schema_option = 0x000000000803509F, @identityrangemanagementoption = N'manual', @destination_table = N'RepUserTable', @destination_owner = N'dbo', @vertical_partition = N'false', @ins_cmd = N'CALL sp_MSins_dboRepUserTable', @del_cmd = N'CALL sp_MSdel_dboRepUserTable', @upd_cmd = N'SCALL sp_MSupd_dboRepUserTable'
GO

--Add a subscription
use [RepSource]
exec sp_addsubscription @publication = N'TestPub2', @subscriber = @@servername, @destination_db = N'RepDest', @subscription_type = N'Push', @sync_type = N'automatic', @article = N'all', @update_mode = N'read only', @subscriber_type = 0
exec sp_addpushsubscription_agent @publication = N'TestPub2', @subscriber = @@servername, @subscriber_db = N'RepDest', @job_login = null, @job_password = null, @subscriber_security_mode = 1, @frequency_type = 64, @frequency_interval = 0, @frequency_relative_interval = 0, @frequency_recurrence_factor = 0, @frequency_subday = 0, @frequency_subday_interval = 0, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 20140107, @active_end_date = 99991231, @enabled_for_syncmgr = N'False', @dts_package_location = N'Distributor'
GO

After running the script, start the snapshot agent and you should be done.

Transaction Log corruption

When you hit transaction log corruption and you rebuild the log file by using DBCC CHECKDB (RepSource,REPAIR_ALLOW_DATA_LOSS) in Emergency mode you'll see that the publication and subscription are gone. You have to recreate these.

  1. Stop SQL Server
  2. rename the transaction log file
  3. start SQL Server and SQL Agent Service

Confirm that your database is now corrupt:

SELECT state_desc 
FROM sys.databases
WHERE name='RepSource'

OUTPUT:
state_desc
------------------------------------------------------------
RECOVERY_PENDING


--try getting the db online
ALTER DATABASE [RepSource] SET ONLINE

OUTPUT:
File activation failure. The physical file name "e:\SQLServer\Log\RepSource_log.ldf" may be incorrect.
The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure.

To fix this problem (without backup/restore) you would have to perform a emergency repair:

ALTER DATABASE [RepSource] SET EMERGENCY
ALTER DATABASE [RepSource] SET SINGLE_USER

DBCC CHECKDB (RepSource,repair_allow_data_loss)
ALTER DATABASE [RepSource] SET MULTI_USER

OUTPUT: File activation failure. The physical file name "e:\SQLServer\Log\RepSource_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure. Warning: The log for database 'RepSource' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

Now, if you look at the output you can see why transactional replication would no longer work:

Warning: The log for database 'RepSource' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were.

Transactional Replication is based on the transaction log and breaking the LSN Chain breaks replication.

If you look at your publication and subscription you'll see that they are both gone. You have to recreate them from scratch.

NOTE: Deleting the log file is quite big step. I don't have any tools to only corrupt a single log record in the transaction log file. I'm not sure if the publication and subscription would be dropped too in that scenario.

Metadata corruption

When you configure replication, there are several system tables created that hold data that is needed for replication to work. So called metadata. There are tables created in master, MSDB, distribution db, publishing db and subcribing db.

Personally I don't have experience with this kind of corruption but supposedly you would have to re-setup replication if any of these system tables in your publishing db gets corrupted. I just tested it strangely I was able to corrupt syspublications, sysarticles, sysarticleupdates and replication kept on going. ??

User data corruption

Now this is where it gets interesting. Because you could have inconsistencies between publisher and subscriber without you knowing.

The reason is, that if you "fix" stuff with DBCC REPAIR_ALLOW_DATA_LOSS it's fixed on a physical level. Transactional replication works with a logreader that reads all transactions in the publishing database log file. If the transaction is marked for replication, it is read and translation in a logically replication statement that is executed on the subcriber. It can deal with Statements like INSERT, DELETE, etc. But it can't deal with repair statements where for example datapages are deallocated. It's not that the repair is not logged. it is! It's just that the log reader doesn't know how to handle them.

Here is an example:

Let's find out which pages are used for the ReUserTable that we created in the RepSource db with the previous script:

--create user data corruption of usertable
use [RepSource]
Go

--find the data pages of RepUserTable
DBCC IND ('RepSource','RepUserTable',1)

OUTPUT: output of DBCC IND

We'll take page 296 for this example. It's a data page (Pagetype=1). Let's look at which 2 records are on this page. (Each record is 4015 Bytes so 2 records per page.)

DBCC TRACEON(3604)
DBCC PAGE('RepSource',1,296,3)

OUTPUT

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

ID = 3   

.......

Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4

ID = 4   

So on this page we have the records with ID 3 and ID 4.

If I try and select the rows AFTER I have corrupted page 296 I get a checksum error:

SELECT * FROM RepUserTable WHERE ID=3


Msg 824, Level 24, State 2, Line 2
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xb5aca98c; actual: 0xb58ea98c). It occurred during a read of page (1:296) in database ID 18 at offset 0x00000000250000 in file 'e:\SQLServer\Data\RepSource.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Since this is corruption on the clustered index I can only fix this with REPAIR_ALLOW_DATA_LOSS

--RUN DBCC CHECKDB
DBCC CHECKDB('RepSource')

OUTPUT:
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 389576426, index ID 1, partition ID 72057594039435264, alloc unit ID 72057594043760640 (type In-row data), page (1:296). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 389576426, index ID 1, partition ID 72057594039435264, alloc unit ID 72057594043760640 (type In-row data): Page (1:296) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 389576426, index ID 1, partition ID 72057594039435264, alloc unit ID 72057594043760640 (type In-row data). Page (1:296) was not seen in the scan although its parent (1:295) and previous (1:293) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 389576426, index ID 1, partition ID 72057594039435264, alloc unit ID 72057594043760640 (type In-row data). Page (1:297) is missing a reference from previous page (1:296). Possible chain linkage problem.

.......
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (RepSource).

Let's do that:

    ALTER DATABASE [RepSource] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
    GO
    DBCC CHECKDB (RepSource,repair_allow_data_loss)
    ALTER DATABASE [RepSource] SET MULTI_USER
OUTPUT    
Repair: The Clustered index successfully rebuilt for the object "dbo.RepUserTable" in database "RepSource".
    Repair: The page (1:296) has been deallocated from object ID 389576426, index ID 1, partition ID 72057594042056704, alloc unit ID 72057594046840832 (type In-row data).
    The error has been repaired.
    There are 9998 rows in 4999 pages for object "RepUserTable".

Now here is the important info: Repair: The page (1:296) has been deallocated from object ID 389576426

The repair works by rebuilding the clustered index on a new set of pages and it simply skips the corrupted page. It fixes all references in allocation pages and linked pages. After the rebuild, it is as if the page was never there. (This is all logged.) The repair can't work ona row level since I corrupted the header and DBCC CHECKDB simply doesn't know what is on the page.

the result: There are 9998 rows in 4999 pages for object "RepUserTable".

1 page gone and 2 records gone. The logreader agent has no clue on how many rows where on that page. so it has no idea on how to replicate this to the subscriber.

If you now restart the log agent. You'll see that Replication doesn't fail.. However:

On the source we don't have record 3 and 4 but they are still present on the subscriber:

Select * from RepUserTable where ID IN (3,4)
/*
ID          TestString
----------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(0 row(s) affected)
*/

--INSERT record into source table to show that replication is still working
INSERT INTO RepUserTable DEFAULT VALUES

Select MAX(ID) from RepUserTable
/*
ID
-----------
10000
*/
Select MAX(ID) as ID from RepDest.dbo.RepuserTable

/*
ID
-----------
10001
*/

Select * from RepDest.dbo.RepuserTable WHERE ID=3
/*
ID          TestString
----------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3           AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*/

--You now have Inconsistent data

If you would now re insert record 3 and 4 on the source database you would get a PK violation error when these rows are replicated to the subscriber. That won't work. And I'm assuming that this is what you mean by "Replication don't work"

But more important. If you wouldn't do that, you wouldn't notice anything. But you would have more records on the subscriber.