MongoDB Replica Set Failover due to unreachable SECONDARY resulted in Rollback

failovermongodbreplication

I have a 3 member replica set with 1 primary, 1 secondary and 1 arbiter. Due to some apparent network issue, the secondary was unreachable. After some time, the secondary became the primary and caused my original primary to undergo a rollback (loss of some write operations).

Following are the logs from each mongod instance:

Primary (XXX.XX.XX.88:27017):

2018-05-11T06:04:30.441+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:04:42.439+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:04:52.439+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out

//********* THE SAME "Error in heartbeat" REPEATS FOR over 20 mins

2018-05-11T06:25:12.514+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:18.515+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:30.516+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:42.223+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:59313 #220 (7 connections now open)
2018-05-11T06:25:42.224+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.227:27017
2018-05-11T06:25:42.224+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:59315 #221 (8 connections now open)
2018-05-11T06:25:42.229+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.227:27017 is now in state SECONDARY
2018-05-11T06:25:42.330+0000 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 3
2018-05-11T06:25:42.334+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY

Secondary (XXX.XX.XX.227:27017):

2018-05-11T06:25:42.200+0000 W NETWORK  [HostnameCanonicalizationWorker] Failed to obtain address information for hostname ip-XXX-XX-XX-227: Name or service not known
2018-05-11T06:25:42.200+0000 I NETWORK  [SyncSourceFeedback] Socket recv() timeout  XXX.XX.XX.88:27017
2018-05-11T06:25:42.202+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.145:50720 #449 (4 connections now open)
2018-05-11T06:25:42.203+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-05-11T06:25:42.205+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-05-11T06:25:42.209+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.88:60370 #450 (5 connections now open)
2018-05-11T06:25:42.211+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.145:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:42.212+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.88:27017; HostUnreachable: Connection timed out
2018-05-11T06:25:42.213+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2018-05-11T06:25:42.230+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.232+0000 I NETWORK  [conn420] end connection XXX.XX.XX.88:59979 (4 connections now open)
2018-05-11T06:25:42.232+0000 I NETWORK  [SyncSourceFeedback] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [XXX.XX.XX.88:27017] 
2018-05-11T06:25:42.232+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.232+0000 I NETWORK  [conn448] end connection XXX.XX.XX.145:33580 (3 connections now open)
2018-05-11T06:25:42.233+0000 I NETWORK  [conn424] end connection XXX.XX.XX.145:59144 (2 connections now open)
2018-05-11T06:25:42.233+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state PRIMARY
2018-05-11T06:25:42.233+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.145:27017
2018-05-11T06:25:42.233+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.145:27017
2018-05-11T06:25:42.245+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.145:27017 is now in state ARBITER
2018-05-11T06:25:42.246+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from XXX.XX.XX.88:27017 because: candidate's data is staler than mine, resp:{ term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2018-05-11T06:25:42.254+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host 'XXX.XX.XX.88:27017' 
2018-05-11T06:25:42.256+0000 I REPL     [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host 'XXX.XX.XX.88:27017' , will retry
2018-05-11T06:25:42.298+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2018-05-11T06:25:42.332+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 3
2018-05-11T06:25:42.360+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
2018-05-11T06:25:42.360+0000 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to XXX.XX.XX.88:27017
2018-05-11T06:25:42.360+0000 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-05-11T06:25:42.361+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:42.379+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state SECONDARY
2018-05-11T06:25:43.241+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted

Arbiter (XXX.XX.XX.145:27017):

2018-05-11T06:04:33.430+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:04:48.428+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

//********* THE SAME "Error in heartbeat" REPEATS FOR over 20 mins

2018-05-11T06:25:28.458+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Operation timed out
2018-05-11T06:25:33.458+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to XXX.XX.XX.227:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2018-05-11T06:25:42.218+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:52658 #35 (3 connections now open)
2018-05-11T06:25:42.219+0000 I NETWORK  [initandlisten] connection accepted from XXX.XX.XX.227:52660 #36 (4 connections now open)
2018-05-11T06:25:43.486+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.227:27017
2018-05-11T06:25:43.487+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.227:27017 is now in state PRIMARY
2018-05-11T06:25:47.203+0000 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to XXX.XX.XX.88:27017
2018-05-11T06:25:47.203+0000 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-05-11T06:25:47.205+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to XXX.XX.XX.88:27017
2018-05-11T06:25:47.205+0000 I REPL     [ReplicationExecutor] Member XXX.XX.XX.88:27017 is now in state SECONDARY

New Replica Set Config after failover :

{
    "_id" : "rs0",
    "version" : 13,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 1,
            "host" : "XXX.XX.XX.227:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "XXX.XX.XX.88:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 3,
            "host" : "XXX.XX.XX.145:27017",
            "arbiterOnly" : true,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("5ae2f18614d8c4dc009d875e")
    }
}

Observations:

The secondary (227) was unreachable for some reason, as seen in logs of Primary(88) and Arbiter(145) – "Error in heartbeat request to XXX.XX.XX.227:27017"
Reason for unreachability seems to be a network issue as seen in logs of the Secondary(227) itself – "Failed to obtain address information for hostname ip-XXX-XX-XX-227: Name or service not known"

Questions:

What could have caused the Secondary(227) to become unreachable?
Why did an unreachable secondary instance cause a rollback on a reachable primary?
Is the replica set configured wrongly that resulted in rollback during failover
Are there ways to avoid this in the future? (apart from the recommendations in mongodb docs – https://docs.mongodb.com/manual/core/replica-set-rollbacks/)

EDIT:
Attaching syslog on Secondary (227)

May 11 06:01:38 ip-XXX-XX-XX-227 dhclient: bound to XXX.XX.XX.227 -- renewal in 1400 seconds.
May 11 06:24:58 ip-XXX-XX-XX-227 dhclient: DHCPREQUEST of XXX.XX.XX.227 on eth0 to XXX.XX.XX.1 port 67 (xid=0x777d16dc)
May 11 06:24:58 ip-XXX-XX-XX-227 dhclient: DHCPACK of XXX.XX.XX.227 from XXX.XX.XX.1
May 11 06:25:42 ip-XXX-XX-XX-227 dhclient: bound to XXX.XX.XX.227 -- renewal in 1660 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.796093] INFO: task jbd2/xvda1-8:206 blocked for more than 120 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.800781]       Not tainted 3.13.0-91-generic #138-Ubuntu
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.804393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809237] jbd2/xvda1-8    D ffff88003fc13180     0   206      2 0x00000000
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809241]  ffff880037689cb0 0000000000000046 ffff88003780c800 0000000000013180
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809243]  ffff880037689fd8 0000000000013180 ffff88003780c800 ffff880037689d98
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809244]  ffff8800377128b8 ffff88003780c800 ffff88001cade100 ffff880037689d80
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809246] Call Trace:
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809253]  [<ffffffff8172dd89>] schedule+0x29/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809258]  [<ffffffff8128ef27>] jbd2_journal_commit_transaction+0x287/0x1b90
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809261]  [<ffffffff8109fe28>] ? sched_clock_cpu+0xa8/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809263]  [<ffffffff810adad0>] ? prepare_to_wait_event+0x100/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809266]  [<ffffffff81076b7b>] ? lock_timer_base.isra.35+0x2b/0x50
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809268]  [<ffffffff81077a4f>] ? try_to_del_timer_sync+0x4f/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809270]  [<ffffffff81294a7d>] kjournald2+0xbd/0x240
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809272]  [<ffffffff810adad0>] ? prepare_to_wait_event+0x100/0x100
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809273]  [<ffffffff812949c0>] ? commit_timeout+0x10/0x10
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809276]  [<ffffffff8108dc79>] kthread+0xc9/0xe0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809278]  [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809281]  [<ffffffff8173a3e8>] ret_from_fork+0x58/0x90
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809283]  [<ffffffff8108dbb0>] ? kthread_create_on_node+0x1c0/0x1c0
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.809288] INFO: task mongod:6576 blocked for more than 120 seconds.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.813394]       Not tainted 3.13.0-91-generic #138-Ubuntu
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.817264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822652] mongod          D ffff88003fc13180     0  6576      1 0x00000000
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822654]  ffff88003bfb9ae0 0000000000000082 ffff88003c42e000 0000000000013180
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822656]  ffff88003bfb9fd8 0000000000013180 ffff88003c42e000 ffff880037712800
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822657]  ffff880037712870 000000000086da6a 0000000000000002 ffff88003d07c030
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822659] Call Trace:
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822662]  [<ffffffff8172dd89>] schedule+0x29/0x70
May 11 06:25:42 ip-XXX-XX-XX-227 kernel: [1279440.822665]  [<ffffffff8128be35>] wait_transaction_locked+0x75/0xa0

Best Answer

It's probably restarted. Check 227 logs between 2018-05-11T06:04 and 2018-05-11T06:25
Because 88 had some writes while 227 was not reachable and lost election after 227 joined the set:

2018-05-11T06:25:42.246+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from XXX.XX.XX.88:27017 because: candidate's data is staler than mine, resp:{ term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
I see nothing wrong with it
Have 3 full members RS. Assuming 145 is a proper instance, when 227 become unreachable, 88 could still replicate writes to 145

Best Answer

Related Solutions

MongoDB Replica Set SECONDARY Stuck in `ROLLBACK` State

Mongodb – Socket recv() timeout on secondary nodes

Related Question