Error:
Replication-replication Distribution Subsystem: agent (null) scheduled
for retry. Query timeout expired.
Based on my research, this error appears when there is a lot of contention on the distribution database. I do see a distribution query running for over 10 hours. If I kill this transaction, won't this cause issues with the replication since this may be a transaction that is reading and copying data? Wouldn't killing it suspend that process and possibly cause corruption.
The publisher and subscribers are all online; these query timeout errors seem to be related to the distribution server and the distribution database contention. I'm trying to identify what's happening overall, since I have many publications that are getting timeouts and from digging into the servers, it only appears the distribution server is behind.
Best Answer
First, examine what the distribution agent saved about the error by executing select * from [dbo].[MSrepl_errors] order by id desc in the distribution database.
When it comes to poor performance in the distribution database, I've experienced several problems. Besides excessive blocking, which I'm sure you are watching, there are several maintenance related issues. First, I've had virtual log file issues slow my replication in the past. When this happened, there was a noticable impact on performance. To address this run dbcc loginfo on the distribution database. If you get over 100 records returned, it's a slight concern. If you get thousands of records returned, then it's a real problem. Fixing it includes changing the autogrowth settings for the log file to a reasonable value, shrinking the log file, then re-initializing it back to its original size. Unfortunately, the default autogrowth size for data files is 1MB, which should be increased as well.
If you don't have this database included in the index and statistics maintenance, you could experience query timeout issues. I once had a specific table in a user database used by a procedure with a recursive function and if this table got fragmented slightly, the procedure would run very long.
It is also possible to use Profiler or Extended Events to trace all of the commands that the problematic distributor executes to determine which command times out. Locate the SQL Agent job that runs the distribution agent with the issue and filter by program name using that job name.
I'm also not averse to creating new indexes in the distribution database, if necessary. I use the query below to help with this, but don't just create everything without reviewing them to see if you can consolidate.
If all of this still does not resolve the issue, then locate the job for the distribution agent and add the parameter -ErrorFile (path and filename local to the server). Monitor the growth of this file because it might grow faster than you expect, but it will contain more details about the problem.