Sql-server – Merge replication – corrupt row continually trying to INSERT

merge-replicationsql serversql server 2014

We have "push" SQL replication sent up, with one publisher and three subscribers. Two of the three are working fine, but one is continually causing trouble. I went through the pain and heartache of reinitialising the subscription at the weekend, and just a few days later it's stuck again.

This time, the error I'm seeing in replmerg.log (after turning up the verbosity) is:

The merge process could not replicate one or more INSERT statements to
the 'Publisher'. A stored procedure failed to execute. Troubleshoot by
using SQL Profiler.

I've tracked this down to a single table in the database: for some reason, an obviously corrupt row is continually trying to be INSERTed into the table, with a rowguid of 00000000-0000-0000-0000-000000000000. DBCC CHECKDB shows no sign of corruption.

Here's the row:

enter image description here

If I try to get rid of the bad row on the Publisher:

delete from mytable
where rowguid like '00000000-0000-0000-0000-000000000000'

I get the error:

Msg 2601, Level 14, State 1, Procedure
MSmerge_del_590C419D78ED4A36A40292C23CF5F67B, Line 68 Cannot insert
duplicate key row in object 'dbo.MSmerge_tombstone' with unique index
'uc1MSmerge_tombstone'. The duplicate key value is (3095029,
00000000-0000-0000-0000-000000000000). The statement has been
terminated.

If I manually remove the row from MSmerge_tombstone, it comes back after the next replication cycle (which is continuous, so typically 60 seconds later).

I really don't want to have to reinitialise this subscription again. Is there any way of purging this set of corrupt transactions out of the "queue", or otherwise returning the replication back to normal?

Best Answer

I've managed to hack my way around the issue, so I'll leave this here for posterity and in case anyone else has the problem. I've no doubt this is an extremely bad idea, and will cause my whole server to implode, but it has got replication firing again.

The msmerge_contents table seems to contain the "queued" UPDATEs and INSERTs, and it's one or more of the latter that seems to be causing the problem. So I did

delete from MSmerge_contents
where tablenick=xxx

... where xxx is the integer identity of the table from the other replication tables. This cleared out a couple of thousand rows, I restarted the agent, and after a bit of processing everything was back to normal. Fortunately, the table in question didn't contain anything too vital, so I don't need to worry about integrity too much.

Again, this is undoubtedly Not The Way To Do It, but it's appeared to work for me.