Merge Replication Gone Wrong – Snapshot Timing Out

merge-replicationsql serversql server 2014

I have a merge replication set up between one Publisher and 3 Subscribers, replicating a single database with around a 100 different tables of various sizes. It's "real-time" replication running continually 24×7, and it's worked fine for the last year or more: the volume of changes is small but constant throughout the day.

However, last night it all started to go wrong. Several tables in the database were "locked": applications couldn't access them. I did the usual reboot, and everything seemed fine, and I didn't think much of it. However, this morning the same thing was happening: vital tables were locked, and digging into sp_who2 showed that replmerg was chewing on something.

I tried to reinitialise the subscription; this seems to have made things worse: the replmerge log is telling me that the snapshot is out of date. When I try to run the snapshot agent – which I've done a few times in recent memory and it's never taken more than a couple of minutes – it gets 46% of the way through before telling getting stuck at:

[46%] Setting up the publication for filtered articles

… and eventually timing out. I've dug into the queries that are running when it's stuck, and it appears to be chewing on:

select @changes_in_tombstone = count(*) 
  from dbo.MSmerge_tombstone with (rowlock) 
  where generation = @g

I've looked at the MSMerge_tombstone table, and it's got 21 million rows in it (a 14 day retention time): I've no idea if this is good or bad.

So I don't know how to proceed: I really don't want to have to delete and recreate the publication – there are changes at the subscribers that I don't want to lose – but it appears I'm stuck in a Catch 22: I need to create a new snapshot in order to get the replication going again, but I can't create a snapshot as it times out.

Should I resign myself to recreating the publication, or is there a way to fix this?

Best Answer

Timing out is probably due to performance.
Reduce the retention period gradually and reinitialize all the subscribers. I've copied and pasted the steps to do this from Claudia Silva's technet blog below:

Evaluate the number of rows that will be deleted:

  1. Use coldate from Msmerge_genshistory to get the list of generations that will be removed
  2. With that list (generation) get the number of rows that will be deleted in other system tables

Example: Select generation from dbo.MSMerge_genhistory where coldata < ‘[current data – retention period you desire]’

Example: So, from today, the 5th of April 2016, I would like a retention of 7 days: < 29th of March 2016

With the list of generations generated, we can see how many rows of data will need to be deleted from the other system tables (MSmerge_contents / MSmerge_tombstone) as well. If the number is quite high, then we increase the datetime: (example) < 24th of March 2016

Depending on the amount of data needed to be deleted, one may need “one to many” intermediate steps before reaching the target retention desired.

The key to this being a success is estimating the effect of the deletes that needs to be executed (by default batch of 5000 rows).

http://blogs.technet.com/b/claudia_silva/archive/2009/06/22/replication-infinite-retention-period-causing-performance-issues.aspx and http://msdn.microsoft.com/en-us/library/ms178557.aspx