I have a merge replication set up between one Publisher and 3 Subscribers, replicating a single database with around a 100 different tables of various sizes. It's "real-time" replication running continually 24×7, and it's worked fine for the last year or more: the volume of changes is small but constant throughout the day.
However, last night it all started to go wrong. Several tables in the database were "locked": applications couldn't access them. I did the usual reboot, and everything seemed fine, and I didn't think much of it. However, this morning the same thing was happening: vital tables were locked, and digging into sp_who2
showed that replmerg
was chewing on something.
I tried to reinitialise the subscription; this seems to have made things worse: the replmerge
log is telling me that the snapshot is out of date. When I try to run the snapshot agent – which I've done a few times in recent memory and it's never taken more than a couple of minutes – it gets 46% of the way through before telling getting stuck at:
[46%] Setting up the publication for filtered articles
… and eventually timing out. I've dug into the queries that are running when it's stuck, and it appears to be chewing on:
select @changes_in_tombstone = count(*)
from dbo.MSmerge_tombstone with (rowlock)
where generation = @g
I've looked at the MSMerge_tombstone
table, and it's got 21 million rows in it (a 14 day retention time): I've no idea if this is good or bad.
So I don't know how to proceed: I really don't want to have to delete and recreate the publication – there are changes at the subscribers that I don't want to lose – but it appears I'm stuck in a Catch 22: I need to create a new snapshot in order to get the replication going again, but I can't create a snapshot as it times out.
Should I resign myself to recreating the publication, or is there a way to fix this?
Best Answer
Timing out is probably due to performance.
Reduce the retention period gradually and reinitialize all the subscribers. I've copied and pasted the steps to do this from Claudia Silva's technet blog below:
Evaluate the number of rows that will be deleted:
Example: Select generation from dbo.MSMerge_genhistory where coldata < ‘[current data – retention period you desire]’
Example: So, from today, the 5th of April 2016, I would like a retention of 7 days: < 29th of March 2016
With the list of generations generated, we can see how many rows of data will need to be deleted from the other system tables (MSmerge_contents / MSmerge_tombstone) as well. If the number is quite high, then we increase the datetime: (example) < 24th of March 2016
Depending on the amount of data needed to be deleted, one may need “one to many” intermediate steps before reaching the target retention desired.
The key to this being a success is estimating the effect of the deletes that needs to be executed (by default batch of 5000 rows).
http://blogs.technet.com/b/claudia_silva/archive/2009/06/22/replication-infinite-retention-period-causing-performance-issues.aspx and http://msdn.microsoft.com/en-us/library/ms178557.aspx