MongoDB Replica Set Failover due to unreachable SECONDARY resulted in Rollback

failovermongodbreplication

I have a 3 member replica set with 1 primary, 1 secondary and 1 arbiter. Due to some apparent network issue, the secondary was unreachable. After some time, the secondary became the primary and caused my original primary to undergo a rollback (loss of some write operations).

Following are the logs from each mongod instance:

Primary (XXX.XX.XX.88:27017):

2018-05-11T06:04:30.441+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:04:42.439+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:04:52.439+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out

//********* THE SAME "Error in heartbeat" REPEATS FOR over 20 mins

2018-05-11T06:25:12.514+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:18.515+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:30.516+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:42.223+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:59313 #220 (7 connections now open)
2018-05-11T06:25:42.224+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.227:27017
2018-05-11T06:25:42.224+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:59315 #221 (8 connections now open)
2018-05-11T06:25:42.229+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.227:27017 is now in state SECONDARY
2018-05-11T06:25:42.330+0000 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 3
2018-05-11T06:25:42.334+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY

Secondary (XXX.XX.XX.227:27017):

2018-05-11T06:25:42.200+0000 W NETWORK  [HostnameCanonicalizationWorker] Failed to obtain address information for hostname ip-XXX-XX-XX-227: Name or service not known
2018-05-11T06:25:42.200+0000 I NETWORK  [SyncSourceFeedback] Socket recv() timeout  XXX.XX.XX.88:27017
2018-05-11T06:25:42.202+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.145:50720 #449 (4 connections now open)
2018-05-11T06:25:42.203+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-05-11T06:25:42.205+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-05-11T06:25:42.209+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.88:60370 #450 (5 connections now open)
2018-05-11T06:25:42.211+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.145:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:42.212+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.88:27017; HostUnreachable: Connection timed out
2018-05-11T06:25:42.213+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2018-05-11T06:25:42.230+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.232+0000 I NETWORK  [conn420] end connection XXX.XX.XX.88:59979 (4 connections now open)
2018-05-11T06:25:42.232+0000 I NETWORK  [SyncSourceFeedback] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [XXX.XX.XX.88:27017] 
2018-05-11T06:25:42.232+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.232+0000 I NETWORK  [conn448] end connection XXX.XX.XX.145:33580 (3 connections now open)
2018-05-11T06:25:42.233+0000 I NETWORK  [conn424] end connection XXX.XX.XX.145:59144 (2 connections now open)
2018-05-11T06:25:42.233+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state PRIMARY
2018-05-11T06:25:42.233+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.145:27017
2018-05-11T06:25:42.233+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.145:27017
2018-05-11T06:25:42.245+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.145:27017 is now in state ARBITER
2018-05-11T06:25:42.246+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from XXX.XX.XX.88:27017 because: candidate's data is staler than mine, resp:{ term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2018-05-11T06:25:42.254+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host 'XXX.XX.XX.88:27017' 
2018-05-11T06:25:42.256+0000 I REPL     [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host 'XXX.XX.XX.88:27017' , will retry
2018-05-11T06:25:42.298+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2018-05-11T06:25:42.332+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 3
2018-05-11T06:25:42.360+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
2018-05-11T06:25:42.360+0000 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to XXX.XX.XX.88:27017
2018-05-11T06:25:42.360+0000 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-05-11T06:25:42.361+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.379+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state SECONDARY
2018-05-11T06:25:43.241+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted

Arbiter (XXX.XX.XX.145:27017):

2018-05-11T06:04:33.430+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:04:48.428+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

//********* THE SAME "Error in heartbeat" REPEATS FOR over 20 mins

2018-05-11T06:25:28.458+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:33.458+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:42.218+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:52658 #35 (3 connections now open)
2018-05-11T06:25:42.219+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:52660 #36 (4 connections now open)
2018-05-11T06:25:43.486+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.227:27017
2018-05-11T06:25:43.487+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.227:27017 is now in state PRIMARY
2018-05-11T06:25:47.203+0000 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to XXX.XX.XX.88:27017
2018-05-11T06:25:47.203+0000 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-05-11T06:25:47.205+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:47.205+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state SECONDARY

New Replica Set Config after failover :

{
    "_id" : "rs0",
    "version" : 13,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 1,
            "host" : "XXX.XX.XX.227:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "XXX.XX.XX.88:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 3,
            "host" : "XXX.XX.XX.145:27017",
            "arbiterOnly" : true,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("5ae2f18614d8c4dc009d875e")
    }
}

Observations:

  1. The secondary (227) was unreachable for some reason, as seen in logs of Primary(88) and Arbiter(145) – "Error in heartbeat request to XXX.XX.XX.227:27017"
  2. Reason for unreachability seems to be a network issue as seen in logs of the Secondary(227) itself – "Failed to obtain address information for hostname ip-XXX-XX-XX-227: Name or service not known"

Questions:

  1. What could have caused the Secondary(227) to become unreachable?
  2. Why did an unreachable secondary instance cause a rollback on a reachable primary?
  3. Is the replica set configured wrongly that resulted in rollback during failover
  4. Are there ways to avoid this in the future? (apart from the recommendations in mongodb docs – https://docs.mongodb.com/manual/core/replica-set-rollbacks/)

EDIT:
Attaching syslog on Secondary (227)

May 11 06:01:38 ip-XXX-XX-XX-227 dhclient: bound to XXX.XX.XX.227 -- renewal in 1400 seconds.
May 11 06:24:58 ip-XXX-XX-XX-227 dhclient: DHCPREQUEST of XXX.XX.XX.227 on eth0 to XXX.XX.XX.1 port 67 (xid=0x777d16dc)
May 11 06:24:58 ip-XXX-XX-XX-227 dhclient: DHCPACK of XXX.XX.XX.227 from XXX.XX.XX.1
May 11 06:25:42 ip-XXX-XX-XX-227 dhclient: bound to XXX.XX.XX.227 -- renewal in 1660 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.796093] INFO: task jbd2/xvda1-8:206 blocked for more than 120 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.800781]       Not tainted 3.13.0-91-generic #138-Ubuntu
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.804393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809237] jbd2/xvda1-8    D ffff88003fc13180     0   206      2 0x00000000
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809241]  ffff880037689cb0 0000000000000046 ffff88003780c800 0000000000013180
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809243]  ffff880037689fd8 0000000000013180 ffff88003780c800 ffff880037689d98
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809244]  ffff8800377128b8 ffff88003780c800 ffff88001cade100 ffff880037689d80
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809246] Call Trace:
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809253]  [<ffffffff8172dd89>] schedule+0x29/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809258]  [<ffffffff8128ef27>] jbd2_journal_commit_transaction+0x287/0x1b90
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809261]  [<ffffffff8109fe28>] ? sched_clock_cpu+0xa8/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809263]  [<ffffffff810adad0>] ? prepare_to_wait_event+0x100/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809266]  [<ffffffff81076b7b>] ? lock_timer_base.isra.35+0x2b/0x50
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809268]  [<ffffffff81077a4f>] ? try_to_del_timer_sync+0x4f/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809270]  [<ffffffff81294a7d>] kjournald2+0xbd/0x240
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809272]  [<ffffffff810adad0>] ? prepare_to_wait_event+0x100/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809273]  [<ffffffff812949c0>] ? commit_timeout+0x10/0x10
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809276]  [<ffffffff8108dc79>] kthread+0xc9/0xe0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809278]  [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809281]  [<ffffffff8173a3e8>] ret_from_fork+0x58/0x90
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809283]  [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809288] INFO: task mongod:6576 blocked for more than 120 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.813394]       Not tainted 3.13.0-91-generic #138-Ubuntu
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.817264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822652] mongod          D ffff88003fc13180     0  6576      1 0x00000000
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822654]  ffff88003bfb9ae0 0000000000000082 ffff88003c42e000 0000000000013180
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822656]  ffff88003bfb9fd8 0000000000013180 ffff88003c42e000 ffff880037712800
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822657]  ffff880037712870 000000000086da6a 0000000000000002 ffff88003d07c030
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822659] Call Trace:
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822662]  [<ffffffff8172dd89>] schedule+0x29/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822665]  [<ffffffff8128be35>] wait_transaction_locked+0x75/0xa0

Best Answer

  1. It's probably restarted. Check 227 logs between 2018-05-11T06:04 and 2018-05-11T06:25

  2. Because 88 had some writes while 227 was not reachable and lost election after 227 joined the set:

    2018-05-11T06:25:42.246+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from XXX.XX.XX.88:27017 because: candidate's data is staler than mine, resp:{ term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }

  3. I see nothing wrong with it

  4. Have 3 full members RS. Assuming 145 is a proper instance, when 227 become unreachable, 88 could still replicate writes to 145