We have one way transactional replication enabled on SQL Server 2008 databases. Due to some issues we had to run DBCC checkDB
, bringing the database to single user mode. When we complete the DBCC
activity and enable replication again, replication won't work. We had to initialize snapshot again for replication to work.
Can you please help me understand why replication is breaking in this scenario? I understand the DBCC
drops some rows and indexes in a table, but as per my understanding when we enable replication these changes should be propagated to subscriber without any issues.
When I checked with my senior team member, I was told this is due to LSN mismatch, but I am not able to understand logically.
Any help is much appreciated.
Best Answer
3 important pieces of information are missing to pinpoint exactly what went wrong in your particular scenario:
CHECKDB
.If you face database corruption in a replication environment and you don't know exactly how to interpret the repair results OR if your publication is not very big, my advise would be to always reinitialize (if possible) or drop/recreate the publication en subscription.
The reason is that most repair operations performed by
DBCC CHECKDB
are not replicated to the subscribers.(see demo later on) If you have very large publications AND you know exactly how to read theDBCC CHECKDB
output you might be able to manually fix things on a table or row level.Main reasons for replication failure after publication db corruption
Transaction log corruption
Metadata corruption
User data corruption
General setup
I'm using the following script to setup a publisher, distributor, subscriber on the same server, create a source and destination database and a table with 10k rows and a publication and subscription for this table. I'll use this setup for the next tests:
SETUP SCRIPT:
After running the script, start the snapshot agent and you should be done.
Transaction Log corruption
When you hit transaction log corruption and you rebuild the log file by using
DBCC CHECKDB (RepSource,REPAIR_ALLOW_DATA_LOSS)
in Emergency mode you'll see that the publication and subscription are gone. You have to recreate these.Confirm that your database is now corrupt:
To fix this problem (without backup/restore) you would have to perform a emergency repair:
OUTPUT: File activation failure. The physical file name "e:\SQLServer\Log\RepSource_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure. Warning: The log for database 'RepSource' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
Now, if you look at the output you can see why transactional replication would no longer work:
Transactional Replication is based on the transaction log and breaking the LSN Chain breaks replication.
If you look at your publication and subscription you'll see that they are both gone. You have to recreate them from scratch.
NOTE: Deleting the log file is quite big step. I don't have any tools to only corrupt a single log record in the transaction log file. I'm not sure if the publication and subscription would be dropped too in that scenario.
Metadata corruption
When you configure replication, there are several system tables created that hold data that is needed for replication to work. So called metadata. There are tables created in master, MSDB, distribution db, publishing db and subcribing db.
Personally I don't have experience with this kind of corruption but supposedly you would have to re-setup replication if any of these system tables in your publishing db gets corrupted. I just tested it strangely I was able to corrupt syspublications, sysarticles, sysarticleupdates and replication kept on going. ??
User data corruption
Now this is where it gets interesting. Because you could have inconsistencies between publisher and subscriber without you knowing.
The reason is, that if you "fix" stuff with DBCC REPAIR_ALLOW_DATA_LOSS it's fixed on a physical level. Transactional replication works with a logreader that reads all transactions in the publishing database log file. If the transaction is marked for replication, it is read and translation in a logically replication statement that is executed on the subcriber. It can deal with Statements like INSERT, DELETE, etc. But it can't deal with repair statements where for example datapages are deallocated. It's not that the repair is not logged. it is! It's just that the log reader doesn't know how to handle them.
Here is an example:
Let's find out which pages are used for the ReUserTable that we created in the RepSource db with the previous script:
OUTPUT:
We'll take page 296 for this example. It's a data page (Pagetype=1). Let's look at which 2 records are on this page. (Each record is 4015 Bytes so 2 records per page.)
OUTPUT
So on this page we have the records with ID 3 and ID 4.
If I try and select the rows AFTER I have corrupted page 296 I get a checksum error:
Since this is corruption on the clustered index I can only fix this with
REPAIR_ALLOW_DATA_LOSS
Let's do that:
Now here is the important info: Repair: The page (1:296) has been deallocated from object ID 389576426
The repair works by rebuilding the clustered index on a new set of pages and it simply skips the corrupted page. It fixes all references in allocation pages and linked pages. After the rebuild, it is as if the page was never there. (This is all logged.) The repair can't work ona row level since I corrupted the header and DBCC CHECKDB simply doesn't know what is on the page.
the result: There are 9998 rows in 4999 pages for object "RepUserTable".
1 page gone and 2 records gone. The logreader agent has no clue on how many rows where on that page. so it has no idea on how to replicate this to the subscriber.
If you now restart the log agent. You'll see that Replication doesn't fail.. However:
On the source we don't have record 3 and 4 but they are still present on the subscriber:
If you would now re insert record 3 and 4 on the source database you would get a PK violation error when these rows are replicated to the subscriber. That won't work. And I'm assuming that this is what you mean by "Replication don't work"
But more important. If you wouldn't do that, you wouldn't notice anything. But you would have more records on the subscriber.