Sql-server – How to break out of error loop in transactional replication

monitoringsql serversql-server-2008-r2transactional-replication

I was running some simple tests on a publisher and subscriber on the same instance:

  1. I inserted 50,000 rows into a published article and the data was
    correctly pushed to the subscriber.
  2. I inadvertently deleted the 49,985th record on the subscriber (but
    didn't realize it at the time)
  3. I deleted 50,000 rows from the published article

As I was watching the size of the subscriber table, I noticed it would drop down in row count almost to zero and then go back to 50,000. Deletes would start running again, the row count would drop, and then it back to 50,000. This kept happening over and over.

I ran a standard trace and saw all the deletes running normally. After it was almost complete, this statement appeared:

IF @@TRANCOUNT > 0 ROLLBACK 

Then I modified the trace to incude error messages and saw this:

The row was not found at the Subscriber when applying the replicated command.

So it looks like when the distribution agent is executing all the deletes one-by-one, the process is still included in a transaction (the original DELETE was one line). If there is an error, it performs a ROLLBACK and then starts over.

My questions:

  1. Is there a standard way to break out of the loop?
  2. Would it be acceptable to recreate the missing row so the process
    can complete?
  3. What do you do if this happens on production? I guess that's not
    really a question. But this seems like a simple error could have
    some pretty serious complications unless replication is monitored
    very carefully.

Best Answer

The error is caused by the code in the custom delete stored procedure for each article:

create procedure [dbo].[sp_MSdel_dboChild]
        @pkc1 int
as
begin  
    delete [dbo].[Child]
where [ChildID] = @pkc1
if @@rowcount = 0
    if @@microsoftversion>0x07320000
        exec sp_MSreplraiserror 20598
end  

If no row is deleted, a special procedure is called and the transaction is rolled back.

In order to avoid this, I've created a post-Snapshot script that alters the all delete / update replication stored procedures and logs the error to a table instead.

CREATE procedure [dbo].[sp_MSdel_dboChild]
        @pkc1 int
as
begin  
    delete [dbo].[Child]
where [ChildID] = @pkc1 
if @@rowcount = 0
    if @@microsoftversion>0x07320000
        insert into repl_errors (database_name, table_name, pkey_col, pkey_val) 
        select dbo.fn_GetDB(APP_NAME()), 'Child', 'ChildID', @pkc1
end