Sql-server – High number of undistributed commands after a “An existing connection was forcibly closed by the remote host” event

replicationsql serversql-server-2008-r2tcpiptransactional-replication

For the lasts days I'm getting a high undistributed commands value in the replication monitor. Restarting the Server Agent solve the problem (momentarily) but after few hours starts again. Even when there are a lot of undistributed commands, I inject a token and get low latency (00:00:09 the most high). sp_replmonitorsubscriptionpendingcmds return the same values from the Replication Monitor.

The 'Distributor To Subscriber History' shows the message "TCP Provider: An existing connection was forcibly closed by the remote host". I'm looking for network issues and I didn't find nothing, just in the Application logs are the events 14152.

There are 6 replication jobs and they fail in similar time, but not in the same moment (all are replicating to the same subscriber), I means, in a period of few minutes between them. If was a network issue, they must broke at same moment, right?

After restart the agent or the server, it's working for several hours. I didn't find any pattern in the time, sometimes broke in the morning, others in the afternoon, etc.

How can I continue debugging the problem?

note: SQL server 2008 R2

Best Answer

It's almost certainly a network/connectivity issue, even if the jobs don't fail at the same time.

Error 14152 is logged when "Agent shuts down after unsuccessfully retrying an operation (agent encounters an error such as server not available, deadlock, connection failure, or time-out failure).", and if the last error was "existing connection was forcibly closed" that suggests there were also earlier tries with the same result even if not logged.

There are many reasons this could be happening intermittently. One specific reason can be multiple NICs in the distributor having the same IP assigned. It could also be the network connection is simply overloaded at times.