Mongodb – Socket recv() timeout on secondary nodes

mongodb

We started using MongoDB pretty recently. We are on version 2.6.3. We have a replica set of 4 nodes (1 Primary + 3 Secondary)

We are currently seeing some latency in replication. The Secondary nodes seem to be falling behind primary by a few minutes (2 mins to 20 minutes).

Looking at the log file of Secondary nodes, I do see socket timeouts.

2014-10-01T08:42:03.485-0400 [rsBackgroundSync] Socket recv() timeout
157.206.233.10:27017 2014-10-01T08:42:03.485-0400 [rsBackgroundSync] SocketException: remote: 157.206.233.10:27017 error: 9001 socket
exception [RECV_TIMEOUT] server [157.206.233.10:27017]
2014-10-01T08:42:03.485-0400 [rsBackgroundSync] DBClientCursor::init
call() failed 2014-10-01T08:42:03.485-0400 [rsBackgroundSync] replSet
sync source problem: 10276 DBClientBase::findN: transport error:
XSNW10K832W:27017 ns: local.oplog.rs query: {}
2014-10-01T08:42:03.485-0400 [rsBackgroundSync] replSet syncing to:
XSNW10K832W:27017 2014-10-01T08:42:23.110-0400 [rsBackgroundSync]
replset setting syncSourceFeedback to XSNW10K832W:27017

Here is what I see in the log file of Primary.

2014-10-01T08:42:23.260-0400 [conn492807] getmore local.oplog.rs
cursorid:95505391617 ntoreturn:0 keyUpdates:0 numYields:1
locks(micros) r:8846 nreturned:1 reslen:1629780 108434ms
2014-10-01T08:42:23.260-0400 [conn492807] end connection
157.206.235.17:61991 (551 connections now open) 2014-10-01T08:42:23.276-0400 [conn496810] getmore local.oplog.rs
cursorid:96066510233 ntoreturn:0 keyUpdates:0 numYields:2
locks(micros) r:4403 nreturned:1 reslen:1629780 108438ms
2014-10-01T08:42:23.276-0400 [conn496810] end connection
157.206.235.16:58038 (550 connections now open) 2014-10-01T08:42:23.276-0400 [conn496812] getmore local.oplog.rs
cursorid:95180580732 ntoreturn:0 keyUpdates:0 numYields:1
locks(micros) r:2960 nreturned:1 reslen:1629780 108440ms
2014-10-01T08:42:23.276-0400 [conn496812] end connection
157.206.235.15:60989 (549 connections now open) 2014-10-01T08:42:23.291-0400 [conn496843] end connection
157.206.235.17:55712 (548 connections now open)

Any insight on what is causing the socket timeouts and how to identify/fix the problem?

Best Answer

Just a shot in the dark. Could it be a problem with the primary servicing too many connectionss simultaneously?

enter image description here

This information comes from the mongodb 202 training.

If a primary can no longer service requests, there will be

  1. socket exceptions

  2. timeouts

  3. the over servers can no longer ping the primary possibly causing an election --maxConns limit connections to mongos can insulate mongods from getting into unpredicatble states

From MongoDB best practices

Connections MongoDB drivers implement connection pooling to facilitate efficient use of resources. Each connection consumes 1MB of RAM, so be careful to monitor the total number of connections so they do not over-whelm the available RAM and reduce the availablememory for the working set. This typically happenswhen client applicationsdo not properly close their connections, or with Java in particular, that relies ongarbage collection to close the connections.

If there are too many connections open on the primary at any one given time, secondaries, administrative, clients, etc, then your primary could also be running out of memory.

Here is an example layout, with connections coming from drivers, secondaries, to and from different processes. This can all quickly add up depending on the number of clients and machines in your infrastructure!

enter image description here