Sql-server – NULL Agent Timing Out

replicationsql server

Error:

Replication-replication Distribution Subsystem: agent (null) scheduled
for retry. Query timeout expired.

Based on my research, this error appears when there is a lot of contention on the distribution database. I do see a distribution query running for over 10 hours. If I kill this transaction, won't this cause issues with the replication since this may be a transaction that is reading and copying data? Wouldn't killing it suspend that process and possibly cause corruption.

The publisher and subscribers are all online; these query timeout errors seem to be related to the distribution server and the distribution database contention. I'm trying to identify what's happening overall, since I have many publications that are getting timeouts and from digging into the servers, it only appears the distribution server is behind.

Best Answer

First, examine what the distribution agent saved about the error by executing select * from [dbo].[MSrepl_errors] order by id desc in the distribution database.

When it comes to poor performance in the distribution database, I've experienced several problems. Besides excessive blocking, which I'm sure you are watching, there are several maintenance related issues. First, I've had virtual log file issues slow my replication in the past. When this happened, there was a noticable impact on performance. To address this run dbcc loginfo on the distribution database. If you get over 100 records returned, it's a slight concern. If you get thousands of records returned, then it's a real problem. Fixing it includes changing the autogrowth settings for the log file to a reasonable value, shrinking the log file, then re-initializing it back to its original size. Unfortunately, the default autogrowth size for data files is 1MB, which should be increased as well.

If you don't have this database included in the index and statistics maintenance, you could experience query timeout issues. I once had a specific table in a user database used by a procedure with a recursive function and if this table got fragmented slightly, the procedure would run very long.

It is also possible to use Profiler or Extended Events to trace all of the commands that the problematic distributor executes to determine which command times out. Locate the SQL Agent job that runs the distribution agent with the issue and filter by program name using that job name.

I'm also not averse to creating new indexes in the distribution database, if necessary. I use the query below to help with this, but don't just create everything without reviewing them to see if you can consolidate.

    USE distribution

    SELECT so.create_date
,dm_mid.database_id AS DatabaseID
,dm_migs.avg_user_impact * (dm_migs.user_seeks + dm_migs.user_scans) Avg_Estimated_Impact
,dm_migs.last_user_seek AS Last_User_Seek
,OBJECT_NAME(dm_mid.OBJECT_ID, dm_mid.database_id) AS [TableName]
,'CREATE INDEX [IX_' + OBJECT_NAME(dm_mid.OBJECT_ID, dm_mid.database_id) + '_' + REPLACE(REPLACE(REPLACE(ISNULL(dm_mid.equality_columns, ''), ', ', '_'), '[', ''), ']', '') + CASE 
    WHEN dm_mid.equality_columns IS NOT NULL
        AND dm_mid.inequality_columns IS NOT NULL
        THEN '_'
    ELSE ''
    END + REPLACE(REPLACE(REPLACE(ISNULL(dm_mid.inequality_columns, ''), ', ', '_'), '[', ''), ']', '') + ']' + ' ON ' + dm_mid.statement + ' (' + ISNULL(dm_mid.equality_columns, '') + CASE 
    WHEN dm_mid.equality_columns IS NOT NULL
        AND dm_mid.inequality_columns IS NOT NULL
        THEN ','
    ELSE ''
    END + ISNULL(dm_mid.inequality_columns, '') + ')' + ISNULL(' INCLUDE (' + dm_mid.included_columns + ')', '') AS Create_Statement
,dm_migs.user_seeks
,dm_migs.user_scans
    FROM sys.dm_db_missing_index_groups dm_mig
    INNER JOIN sys.dm_db_missing_index_group_stats dm_migs 
ON 
dm_migs.group_handle = dm_mig.index_group_handle
    INNER JOIN sys.dm_db_missing_index_details dm_mid 
ON 
dm_mig.index_handle = dm_mid.index_handle
    JOIN sys.objects so 
ON 
so.object_id = dm_mid.OBJECT_ID
    WHERE 
dm_mid.database_ID = DB_ID()
    ORDER BY 
Avg_Estimated_Impact DESC

If all of this still does not resolve the issue, then locate the job for the distribution agent and add the parameter -ErrorFile (path and filename local to the server). Monitor the growth of this file because it might grow faster than you expect, but it will contain more details about the problem.