Mongodb – 3 Node Replica Set all became SECONDARY

mongodbreplication

My 3 set replica set all became secondary and I'm not sure why.

The logging I got was:

db1

2014-12-12T02:43:55.067+0000 [conn1413096] end connection
10.0.64.12:58483 (512 connections now open) 2014-12-12T02:43:55.067+0000 [initandlisten] connection accepted from
10.0.64.12:58485 #1413098 (513 connections now open) 2014-12-12T02:44:01.068+0000 [conn1413097] end connection
10.0.64.11:35195 (512 connections now open) 2014-12-12T02:44:01.069+0000 [initandlisten] connection accepted from
10.0.64.11:35197 #1413099 (513 connections now open) 2014-12-12T02:44:14.070+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12), connection attempt failed 2014-12-12T02:44:19.071+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed 2014-12-12T02:44:22.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11), connection attempt failed 2014-12-12T02:44:24.072+0000 [rsHealthPoll] replset info 10.0.64.12:27017 just heartbeated us, but our heartbeat failed: , not changing state 2014-12-12T02:44:25.073+0000 [conn1413098] end connection
10.0.64.12:58485 (512 connections now open) 2014-12-12T02:44:27.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed 2014-12-12T02:44:31.072+0000 [rsHealthPoll] couldn't connect to
10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed 2014-12-12T02:44:31.075+0000 [conn1413099] end connection
10.0.64.11:35197 (511 connections now open) 2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet info
10.0.64.11:27017 is down (or slow to respond):  2014-12-12T02:44:32.074+0000 [rsHealthPoll] replSet member
10.0.64.11:27017 is now in state DOWN 2014-12-12T02:44:35.873+0000 [initandlisten] connection accepted from 10.0.64.9:43513 #1413100 (512 connections now open) 2014-12-12T02:44:35.878+0000 [conn1413100]  authenticate db: admin { authenticate: 1, nonce: "xxx", user: "loguetr", key: "xxx" } 2014-12-12T02:44:36.073+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server
10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed 2014-12-12T02:44:38.626+0000 [conn1413100] end connection
10.0.64.9:43513 (511 connections now open) 2014-12-12T02:44:39.074+0000 [rsHealthPoll] couldn't connect to
10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed 2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet info
10.0.64.12:27017 is down (or slow to respond):  2014-12-12T02:44:41.074+0000 [rsHealthPoll] replSet member
10.0.64.12:27017 is now in state DOWN 2014-12-12T02:44:41.074+0000 [rsMgr] can't see a majority of the set, relinquishing primary 2014-12-12T02:44:41.074+0000 [rsMgr] replSet relinquishing primary state 2014-12-12T02:44:41.074+0000 [rsMgr] replSet SECONDARY 2014-12-12T02:44:41.074+0000 [rsMgr] replSet closing client sockets after relinquishing primary

db2

2014-12-12T02:44:28.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12), connection attempt failed
2014-12-12T02:44:33.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:36.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet info 10.0.64.12:27017 is down (or slow to respond): 
2014-12-12T02:44:38.077+0000 [rsHealthPoll] replSet member 10.0.64.12:27017 is now in state DOWN
2014-12-12T02:44:41.078+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: 10.0.64.10:27017
2014-12-12T02:44:41.088+0000 [rsBackgroundSync] replSet syncing to: 10.0.64.10:27017
2014-12-12T02:44:43.145+0000 [initandlisten] connection accepted from 10.0.0.11:40772 #56196 (7 connections now open)
2014-12-12T02:44:45.078+0000 [rsHealthPoll] couldn't connect to 10.0.64.12:27017: couldn't connect to server 10.0.64.12:27017 (10.0.64.12) failed, connection attempt failed
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet info 10.0.64.10:27017 is down (or slow to respond): 
2014-12-12T02:44:46.079+0000 [rsHealthPoll] replSet member 10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:46.079+0000 [rsMgr] replSet can't see a majority, will not try to elect self

db3

2014-12-12T02:44:20.075+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11), connection attempt failed
2014-12-12T02:44:23.077+0000 [conn55973] end connection 10.0.64.11:50146 (6 connections now open)
2014-12-12T02:44:25.076+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet info 10.0.64.11:27017 is down (or slow to respond): 
2014-12-12T02:44:30.077+0000 [rsHealthPoll] replSet member 10.0.64.11:27017 is now in state DOWN
2014-12-12T02:44:30.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10), connection attempt failed
2014-12-12T02:44:35.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.10:27017: couldn't connect to server 10.0.64.10:27017 (10.0.64.10) failed, connection attempt failed
2014-12-12T02:44:37.077+0000 [rsHealthPoll] couldn't connect to 10.0.64.11:27017: couldn't connect to server 10.0.64.11:27017 (10.0.64.11) failed, connection attempt failed
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet info 10.0.64.10:27017 is down (or slow to respond): 
2014-12-12T02:44:40.079+0000 [rsHealthPoll] replSet member 10.0.64.10:27017 is now in state DOWN
2014-12-12T02:44:40.080+0000 [rsMgr] replSet can't see a majority, will not try to elect self

rs.conf()

{
"_id" : "rs0",
"version" : 3,
"members" : [
    {
        "_id" : 0,
        "host" : "10.0.64.10:27017"
    },
    {
        "_id" : 1,
        "host" : "10.0.64.11:27017"
    },
    {
        "_id" : 2,
        "host" : "10.0.64.12:27017"
    }
]}

rs.status()

{
"set" : "rs0",
"date" : ISODate("2014-12-12T19:17:00Z"),
"myState" : 2,
"members" : [
   {
       "_id" : 0,
       "name" : "10.0.64.10:27017",
       "health" : 0,
       "state" : 8,
       "stateStr" : "(not reachable/healthy)",
       "uptime" : 0,
       "optime" : Timestamp(0, 0),
       "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
       "lastHeartbeat" : ISODate("2014-12-12T19:16:58Z"),
       "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
       "pingMs" : 0
   },
   {
       "_id" : 1,
       "name" : "10.0.64.11:27017",
       "health" : 0,
       "state" : 8,
       "stateStr" : "(not reachable/healthy)",
       "uptime" : 0,
       "optime" : Timestamp(0, 0),
       "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
       "lastHeartbeat" : ISODate("2014-12-12T19:16:55Z"),
       "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
       "pingMs" : 0
   },
   {
       "_id" : 2,
       "name" : "10.0.64.12:27017",
       "health" : 1,
       "state" : 2,
       "stateStr" : "SECONDARY",
       "uptime" : 53226,
       "optime" : Timestamp(1418352057, 1),
       "optimeDate" : ISODate("2014-12-12T02:40:57Z"),
       "self" : true
   }
],
"ok" : 1
}

Best Answer

This is a typical network issue. If you check your logs all nodes lost connectivity with each other the same time which means that your network layer failed. Your configuration is fine , i would propose you to use DNS names instead of IPs which will make you more flexible.