tracking deadlocks is the easier of the two:
By default, deadlocks are not written in the error log. You can cause SQL to write deadlocks to the error log with trace flags 1204 and 3605.
Write deadlock info to the SQL Server error log:
DBCC TRACEON(-1, 1204, 3605)
Turn it off:
DBCC TRACEOFF(-1, 1204, 3605)
See "Troubleshooting Deadlocks" for a discussion of trace flag 1204 and the output you will get when it is turned on.
https://msdn.microsoft.com/en-us/library/ms178104.aspx
Prevention is more difficult, essentially you have to look out for the following:
Code Block 1 locks resource A, then resource B, in that order.
Code Block 2 locks resource B, then resource A, in that order.
This is the classic condition where a deadlock can occur, if the locking of both the resources is not atomic, the Code Block 1 can lock A and be pre-empted, then Code Block 2 locks B before A gets processing time back. Now you have deadlock.
To prevent this condition, you can do something like the following
Code Block A (psuedo code)
Lock Shared Resource Z
Lock Resource A
Lock Resource B
Unlock Shared Resource Z
...
Code Block B (pseudo code)
Lock Shared Resource Z
Lock Resource B
Lock Resource A
Unlock Shared Resource Z
...
not forgetting to unlock A and B when done with them
this would prevent the deadlocking between code block A and code block B
From a database perspective, I'm not sure on how to go about preventing this situation, as locks are handled by the database itself, i.e. row/table locks when updating data. Where I've seen the most issues occur is where you saw yours, inside a cursor. Cursors are notoriously inefficient, avoid them if at all possible.
"It looks like something is updating the same row at both servers with different content and made merge agent crash?" This is handled by merge conflict tables, and would not cause the issues you are describing. These conflict tables are located on the publisher database, and are named like: MSMerge_conflict__.
To answer your question about what reinitialization does, by default, reinitialization will take a snapshot of your published articles, drop the articles on the subscriber side, recreate the articles on the subscriber side, and then bulk load data from the snapshot into the subscriber articles. Since this is a production environment, and those articles need to be available on the subscriber side, this should only be used as a last resort.
What you can do is query the MSrepl_errors table on the Distribution database. This will provide you with a command_id and an xact_seqno. You can use these values as inputs into the sys.sp_browsereplcmds stored procedure. This will provide you with command text that is actually failing. Using this information, you can better understand the nature of the failure. If a particular row cannot be inserted or deleted at the subscriber, you may have to either delete the existing row (to allow the insert) or insert a dummy row (to allow the delete), respectively.
I hope this information helps,
Matt
Best Answer
I recommend zespri's approach. Use profiler to find the query that is taking an inordinate amount of time.
If you've somehow had no joy with that, you can try the following query which will show a process with a particularly large waittime value.
It filters out any system processes and SQL which has contains
WAITFOR DELAY '00:00:05'
... etcYou can then dig a little deeper with the spid against the row that looks like it is causing the problem and inspect the SQL code that is running/locking:
And the resulting output will look like this:
This should help you narrow down the problem query.