SQL Server 2012 Merge Replication – Blocking During Sync Issues

blockingmerge-replicationsql-server-2012

I have a merge replication setup between two SQL Server 2012 databases via Pull method.

There are several merge replications affecting different table sets (3 sets 2-5 tables each, actually). There are no foreign keys linking them between or to another tables in DBs.

Both subscriber and publisher are changing data in replicated tables. Usually, there are more than 800k inserts per 12-hour day on the publisher, and around 300k on the subscriber; almost no updates.

Also, I have simple filtering enabled – one bit NULL column is checked for being NULL for almost all merged tables; no joins in filters. The main database has around 80G of data in those tables, the subscriber has around 30G due to filtering. There are indexes on all that filter columns. Tables are replicated as-is, no selective columns.

Primary key ranges are increased to 10M on most tables (to be sure they will not be re-placed often).

Schema replication is enabled, but no schema changes occurs.

Merge replication retention periods is set for 2 days (resulting in 3 days of stored data – around ~3M rows in MSmerge_contents table).

Replication job scheduled to start each 15 minutes.

The problem is – sometimes, when sync job kicks in, it blocks ALL inserts/updates on ALL tables in ALL merge replications in this DB. It seems it occurs only at the end of replication process regardless of synced row count (for some runs there may be only 1-2 changed/added rows) and can last for 5-10 minutes, what is unacceptable.

I traced what procedure is executing at that times of locking and caught blocking report using SQL Server Profiler and starting it manually (it seems to be safe) :

<blocked-process-report monitorLoop="201058">
 <blocked-process>
  <process id="process38d8dd0c8" taskpriority="0" logused="1268" waitresource="OBJECT: 5:690456609:0 " waittime="5002" ownerId="4004024182" transactionname="UPDATE" lasttranstarted="2015-03-11T16:08:39.890" XDES="0x1b253649e8" lockMode="IX" schedulerid="1" kpid="5132" status="suspended" spid="84" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2015-03-11T16:08:39.880" lastbatchcompleted="2015-03-11T16:08:39.880" lastattention="2015-03-11T15:29:13.240" hostname="COMP-177" hostpid="2976" loginname="user" isolationlevel="read committed (2)" xactid="4004024182" currentdb="5" lockTimeout="4294967295" clientoption1="673316896" clientoption2="128056">
   <executionStack>
    <frame line="46" stmtstart="4398" stmtend="4936" sqlhandle="0x030005008910305b49b9150057a4000000000000000000000000000000000000000000000000000000000000"/>
    <frame line="1" stmtstart="30" sqlhandle="0x020000003f32c4168d189e55398799d8e66489e031b7fa8b0000000000000000000000000000000000000000"/>
    <frame line="1" stmtstart="30" sqlhandle="0x02000000538a8c182d9ff435ad6897538e470878728dfd940000000000000000000000000000000000000000"/>
   </executionStack>
   <inputbuf>
set nocount on;update [replicated_table] set CHECK_CI=&apos;1456&apos;,DATE_CHECK_CI=getdate(),DATE_MODIFY=getdate(),where R_ID=&apos;284598973&apos;
   </inputbuf>
  </process>
 </blocked-process>
 <blocking-process>
  <process status="running" spid="150" sbid="0" ecid="0" priority="-5" trancount="2" lastbatchstarted="2015-03-11T16:08:39.347" lastbatchcompleted="2015-03-11T16:08:21.833" lastattention="1900-01-01T00:00:00.833" clientapp="Microsoft SQL Server Management Studio - Query" hostname="COMP-120" hostpid="9460" loginname="userp" isolationlevel="read committed (2)" xactid="4004019824" currentdb="5" lockTimeout="4294967295" clientoption1="673319008" clientoption2="390168">
   <executionStack>
    <frame line="366" stmtstart="32078" stmtend="32458" sqlhandle="0x0300ff7fa885d0f933812f012ba3000001000000000000000000000000000000000000000000000000000000"/>
    <frame line="1" sqlhandle="0x010005002698da17601c849b1d00000000000000000000000000000000000000000000000000000000000000"/>
   </executionStack>
   <inputbuf>
exec sp_MSmakegeneration   </inputbuf>
  </process>
 </blocking-process>
</blocked-process-report>

sys.sp_MSmakegeneration execution blocks all other activity. As I get it – this procedure is generating list of rows to be replicated. I can understand it can run for a while – but why it prevents other inserts/updates on ALL merge-enabled tables? It does not matter which publication is synced – ALL merge-enabled tables (even from another publications in this DB) also stop working.

The waitobject waitresource="OBJECT: 5:690456609:0 " is MSmerge_genhistory table. So, it somehow blocks on that table, but why?

I tried to capture that procedure actual execution plan – I got 14k lines of XML, but found nothing suspicious there (no obvious table-scans or other heavy activity).

Replication setup is mostly default, except for generation_leveling_threshold option increased to 10000 during problem investigation. It didn't help.

SQL Server 2012 v 11.0.5058.0

Default isolation level is set to READ COMMITTED SNAPSHOT; those inserts/updates are just plain inserts/updates, without isolation level changing or starting explicit transactions.

I can successfully sync servers after work day (when nobody touches DB), it runs OK in 20-30 minutes (transferring whole day data at once) but I need continuous syncing (15-20 minutes delay is OK though).

I googled the problem – there are a lot of guys with merge replication stuck due to a lot of changed data (starting with 1M rows), but nothing similar to my problem.

Best Answer

I personally DO NOT recommend setting up merge replication under such conditions. But, as there is no way for me (other than make this work or develop my own replication solution), I spent two weeks experimenting and found a solution.

  1. Disable Precompute partitions option on Publication Properties -> Subscription Options -> Filtering window. This makes sp_MSmakegeneration lock only that particular table it's currently working on, not all merge replication affected tables. This option becomes enabled automatically if you have any row filtering. But actually, it's only intended to speed up parameterized filters (not my case). Note: turning this off will invalidate subscriptions; you must re-initialize them with fresh snapshot.
  2. Create simple job with exec dbo.sp_MSmakegeneration command and schedule it every 1 minute. Normally, it called as part of synchronization process to enumerate local changes. But, if you skip calling it for 10-15 minutes (it means 20-30k changed/inserted rows in my case), the next synchronization calling it will block all client inserts/updates into replicated table; also, there is chances you will catch deadlock during its execution, making it impossible to run it successfully until server load is reduced. This, ofcourse, depends on your particular table indexes and usage.
  3. Schedule your synchronization jobs to run frequently. I personally set it to run each 3 minutes. This reduces sync delay to 5-10 minutes. Strictly speaking, this is not mandatory, but syncing a lot of changes is quite a heavy load itself, so its easier for server to sync smaller parts.

Using this, I am able to successfully sync 3 servers, with 3-10k inserts/updates per minute on each during workday with no major delays.