Mongodb – Connectivity issues in replication process

mongodbreplication

I've been trying to add a new node to a replica set without success due to these errors:

2019-08-12T06:48:15.031+0200 I ASIO [NetworkInterfaceASIO-RS-0]
Ending connection to host mongo03:27017 due to bad connection status;
1 connections to that host remain open 2019-08-12T06:48:15.031+0200 I
REPL [replication-224] Restarting oplog query due to error:
NetworkInterfaceExceededTimeLimit: error in fetcher batch callback:
Operation timed out. Last fetched optime (with hash): { ts:
Timestamp(1565585283, 1317), t: 9 }[1741433982986566690]. Restarts
remaining: 10 2019-08-12T06:48:15.032+0200 I REPL
[replication-224] Scheduled new oplog query Fetcher source:
mongo03:27017 database: local query: { find: "oplog.rs", filter: { ts:
{ $gte: Timestamp(1565585283, 1317) } }, tailable: true, oplogReplay:
true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
readConcern: { afterOpTime: { ts: Timestamp(1565585283, 1317), t: 9 }
} } query metadata: { $replData: 1, $oplogQueryData: 1,
$readPreference: { mode: "secondaryPreferred" } } active: 1
findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting
down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler
request: RemoteCommand 2262327 — target:mongo03:27017 db:local cmd:{
find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565585283, 1317) }
}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
2000, batchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
Timestamp(1565585283, 1317), t: 9 } } } active: 1
callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1
retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2019-08-12T06:48:17.637+0200 I ASIO [NetworkInterfaceASIO-RS-0]
Connecting to mongo03:27017 2019-08-12T06:48:17.645+0200 I ASIO
[NetworkInterfaceASIO-RS-0] Successfully connected to mongo03:27017,
took 8ms (2 connections now open to mongo03:27017)
2019-08-12T06:49:36.961+0200 I – [repl writer worker 13]
DB.CollectionName collection clone progress: 221390504/643329790
34% (documents copied) 2019-08-12T06:51:52.859+0200 I ASIO
[NetworkInterfaceASIO-RS-0] Ending connection to host mongo03:27017
due to bad connection status; 1 connections to that host remain open
2019-08-12T06:51:52.859+0200 I REPL [replication-229] Restarting
oplog query due to error: NetworkInterfaceExceededTimeLimit: error in
fetcher batch callback: Operation timed out. Last fetched optime (with
hash): { ts: Timestamp(1565585502, 204), t: 9 }[3285716995067894969].
Restarts remaining: 10 2019-08-12T06:51:52.859+0200 I ASIO
[NetworkInterfaceASIO-RS-0] Connecting to mongo03:27017
2019-08-12T06:51:52.859+0200 I REPL [replication-229] Scheduled
new oplog query Fetcher source: mongo03:27017 database: local query: {
find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565585502, 204) }
}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
2000, batchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
Timestamp(1565585502, 204), t: 9 } } } query metadata: { $replData: 1,
$oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } }
active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms
shutting down?: 0 first: 1 firstCommandScheduler:
RemoteCommandRetryScheduler request: RemoteCommand 2272962 —
target:mongo03:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: {
$gte: Timestamp(1565585502, 204) } }, tailable: true, oplogReplay:
true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
readConcern: { afterOpTime: { ts: Timestamp(1565585502, 204), t: 9 } }
} active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0
attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis:
-1ms

I'm a little bit lost regarding how to debug it.
It's a 250GB database, the synchronization is being done through a private network 100Mb. There is an iptables in all servers but the traffic from the private network is allowed:

-A INPUT -i eth1 -p tcp -m tcp --dport 27017 -m state --state NEW,ESTABLISHED -j ACCEPT

At some point the replication process fails:

2019-08-12T02:22:37.582+0200 I REPL [replication-143] Restarting
oplog query due to error: ExceededTimeLimit: error in fetcher batch
callback: Executor error during find command :: caused by :: errmsg:
"oper ation exceeded time limit". Last fetched optime (with hash): {
ts: Timestamp(1565569328, 710), t: 9 }[7908135344534364729]. Restarts
remaining: 1 2019-08-12T02:22:37.582+0200 I REPL [replication-143]
Scheduled new oplog query Fetcher source: mongo03:27017 database:
local query: { find: "oplog.rs", filter: { ts: { $gte:
Timestamp(1565569328, 710) } }, tailable: true, oplogReplay: true,
awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
readConcern: { afterOpTime: { ts: Timestamp(1565569328, 710), t: 9 } }
} query metadata: { $replData: 1, $oplogQ ueryData: 1,
$readPreference: { mode: "secondaryPreferred" } } active: 1
findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting
down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler
re quest: RemoteCommand 1381614 — target:mongo03:27017 db:local cmd:{
find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565569328, 710) }
}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
2000, ba tchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
Timestamp(1565569328, 710), t: 9 } } } active: 1 callbackHandle.valid:
1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl
maxAttem pts: 1 maxTimeMillis: -1ms 2019-08-12T02:22:39.612+0200 I
REPL [replication-142] Error returned from oplog query (no more
query restarts left): ExceededTimeLimit: error in fetcher batch
callback: Executor error during find command :: caused by :: errmsg:
"operation exceeded time limit" 2019-08-12T02:22:39.615+0200 I REPL
[replication-142] Finished fetching oplog during initial sync:
ExceededTimeLimit: error in fetcher batch callback: Executor error
during find command :: caused by :: errms g: "operation exceeded time
limit". Last fetched optime and hash: { ts: Timestamp(1565569328,
710), t: 9 }[7908135344534364729] 2019-08-12T02:22:43.034+0200 I REPL
[replication-142] CollectionCloner ns:DB.CollectionName finished
cloning with status: IllegalOperation: AsyncResultsMerger killed
2019-08-12T02:22:44.551+0200 W REPL [replication-142] collection
clone for 'DB.CollectionName' failed due to IllegalOperation: While
cloning collection 'DB.CollectionName' there was an error
'AsyncResultsMerger killed' 2019-08-12T02:22:44.551+0200 I REPL
[replication-142] CollectionCloner::start called, on ns:DB.CollectionName2
2019-08-12T02:22:44.551+0200 W REPL [replication-142] database
'dbName' (9 of 9) clone failed due to ShutdownInProgress: collection
cloner completed 2019-08-12T02:22:44.551+0200 I REPL
[replication-142] Finished cloning data: ShutdownInProgress:
collection cloner completed. Beginning oplog replay.
2019-08-12T02:22:44.551+0200 I REPL [replication-142] Initial sync
attempt finishing up.

And it has to start over again.

Don't know if I could configure a bigger time out to avoid this disconnections.

Thanks in advance.

Best Answer

  1. Please check your connection time out settings on your mongo servers
  2. Please try increasing the ulimit settings on your servers
  3. Please implement retry logic in your driver level so the driver can handle this

Thank you!

-Anban