Mongodb – In mongodb 3.0 replication, how elections happen when a secondary goes down

mongodbmongodb-3.0replication

Situation: I have a MongoDB replication set over two computers.

One computer is a server that holds the primary node and the arbiter. This server is a live server and is always on. It's local IP that is used in replication is 192.168.0.4.
Second is a PC that the secondary node resides on and is on for a few hours a day. It's local IP that is used in replication is 192.168.0.5.

My expectation: I wanted the live server to be the main point of data interaction of my application, regardless of the state of the PC (whether it is reachable or not, since PC is secondary), so I wanted to make sure that server's node is always primary.

The following is the result of rs.config():

liveSet:PRIMARY> rs.config()
{
    "_id" : "liveSet",
    "version" : 2,
    "members" : [
        {
            "_id" : 0,
            "host" : "192.168.0.4:27017",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 10,
            "tags" : {

            },
            "slaveDelay" : 0,
            "votes" : 1
        },
        {
            "_id" : 1,
            "host" : "192.168.0.5:5051",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : 0,
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "192.168.0.4:5052",
            "arbiterOnly" : true,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : 0,
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatTimeoutSecs" : 10,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        }
    }
}

Also I have set the storage engine to be WiredTiger, if that matters.

What I actually get, and the problem: When I turn off the PC, or kill its mongod process, then the node on the server becomes secondary.

The following is the output of the server when I killed PC's mongod process, while connected to primary node's shell:

liveSet:PRIMARY>
2015-11-29T10:46:29.471+0430 I NETWORK  Socket recv() errno:10053 An established connection was aborted by the software in your host machine. 127.0.0.1:27017
2015-11-29T10:46:29.473+0430 I NETWORK  SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [RECV_ERROR] server [127.0.0.1:27017]
2015-11-29T10:46:29.475+0430 I NETWORK  DBClientCursor::init call() failed
2015-11-29T10:46:29.479+0430 I NETWORK  trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2015-11-29T10:46:29.481+0430 I NETWORK  reconnect 127.0.0.1:27017 (127.0.0.1) ok
liveSet:SECONDARY>

There are two doubts for me:

Considering this part of MongoDB documentation:

Replica sets use elections to determine which set member will become primary. Elections occur after initiating a replica set, and also any time the primary becomes unavailable.

The election occurs when the primary is not available (or at the time of initiating, however this is part does not concern our case), but primary was always available, so why the election happens.

Considering this part of the same documentation:

If a majority of the replica set is inaccessible or unavailable, the replica set cannot accept writes and all remaining members become read-only.

Considering the part 'members become read-only', I have two nodes up vs one down, so this should not also affect our replication.

Now my question: How to keep the node on the server as primary, when the node on PC is not reachable?

Update:
This is the output of rs.status().

Now this makes the behavior obvious, since arbiter was not reachable.

liveSet:PRIMARY> rs.status()
{
    "set" : "liveSet",
    "date" : ISODate("2015-11-30T04:33:03.864Z"),
    "myState" : 1,
    "members" : [
        {
            "_id" : 0,
            "name" : "192.168.0.4:27017",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 1807553,
            "optime" : Timestamp(1448796026, 1),
            "optimeDate" : ISODate("2015-11-29T11:20:26Z"),
            "electionTime" : Timestamp(1448857488, 1),
            "electionDate" : ISODate("2015-11-30T04:24:48Z"),
            "configVersion" : 2,
            "self" : true
        },
        {
            "_id" : 1,
            "name" : "192.168.0.5:5051",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 496,
            "optime" : Timestamp(1448796026, 1),
            "optimeDate" : ISODate("2015-11-29T11:20:26Z"),
            "lastHeartbeat" : ISODate("2015-11-30T04:33:03.708Z"),
            "lastHeartbeatRecv" : ISODate("2015-11-30T04:33:02.451Z"),
            "pingMs" : 1,
            "configVersion" : 2
        },
        {
            "_id" : 2,
            "name" : "192.168.0.4:5052",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "lastHeartbeat" : ISODate("2015-11-30T04:33:00.008Z"),
            "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
            "configVersion" : -1
        }
    ],
    "ok" : 1
}
liveSet:PRIMARY>

Best Answer

What happened

I have asked in the comments of the question that OP provides the output of rs.status()

The reason for that was that the primary reverted to secondary status once a single member was shut down. It was obvious that the cluster lost the quorum necessary to elect a new primary. This could only be the case when one additional of the voting members of the original members of the replica set is or became unavailable.

As it turned out, the arbiter of the replica set in question was not reachable by the primary, which after the shutdown of the PC was the only remaining member (from it's point of view) of the replica set. So it wasn't possible to hold an election with a quorum of the configured replica set members and it consequently reverted to secondary state.

How to prevent

Always run rs.status() after setting up a replica set.
Always run rs.status() when encountering problems with a replica set
Always do fail tests (down to loosing write capabilities) and ensure your application handles those situations gracefully (as OP did)

Using these rules, you will eliminate the vast majority of problems one can face when using a replica set.

Personally, I think MongoDB Inc.'s Cloud Manager is a must for production environments, since it shows such problems os OP had right away and has alerting built in.

Side note

Never, ever (and yes, that means no exception for no reason, however sound the reasons may seem to be) put an arbiter on a data bearing node of the same replica set.

Imagine the node with the arbiter and the data bering node goes down.

If you have a 3 member replica set, you wouldn't have a quorum of the original members, your remaining member would automatically revert to secondary, loosing the failover capability.

In a 5 member replica set, two voting members would be eliminated. Fine as long as all others are up and running, right? Except, it isn't fine. If another node fails, you'd loose your quorum again. So with only two nodes failed, the other two nodes become more or less useless. Given the price of a virtual server today (and even the smallest ones are well sufficient to run an arbiter), this simply does not make sense. You'd be paying 4 data bearing nodes anyway and loose failover capabilities because you tried to save a tiny fraction of the overall costs.

With a 7 member replica set, those costs become an even tinier fraction of the overall costs.

Conclusion: It's simply a bad decision business wise to have a arbiter running on the same machine as a data bearing node, even when setting aside the technical aspects.

Related Solutions

Mongodb replication node stuck at “STARTUP2” with optimeDate as 1970

No, this is not OK, STARTUP2 should only be a state that a secondary is in briefly on its way to a full sync and SECONDARY status (via RECOVERING usually) - see the states table for more. However, without seeing log files it's impossible to say why it is stuck. The 1970 date you are seeing in the optime is basically the epoch/Unix time version of zero, indicating that it has not applied any ops.

The basic method for restoring this secondary would be to shut it down, wipe out its data files (all files in the dbpath), and restart it. That will restart the initial sync process and it should get to SECONDARY. If it gets stuck again, then look at the logs for more information as to why it is happening - the most common causes would be some sort of resource issue or possibly not being able to talk to the primary in the set reliably.

Mongodb – Can arbiterOnly replica in MongoDB become SECONDARY and what it means

Some time ago my one of my data replicas which was secondary at a time crushed due to hard drive failure. After I fixed that problem and restarted secondary it went into “Recovering” state. But my arbiter is now “Secondary”

A MongoDB arbiter cannot automatically become a secondary or a primary node, as it does not have a copy of the data set.

If you try to manually reconfigure the arbiter as a regular node via rs.reconfig() you should get an exception similar to:

{
    "errmsg" : "exception: arbiterOnly may not change for members",
    "code" : 13510,
    "ok" : 0
}

Furthermore, the data directory for this Arbiter shows bunch of files of total size >10GB. They indeed look to me like data files. Are they? What is going to happen to these files when Recovery completes?

Assuming this node is definitely an arbiter, I would expect those files are unused (check the timestamps?) and are either:

a local database created if this node was incorrectly initialized with an oplog
unused copy of data directory if this node was copied from another secondary or used as a standalone before

You can always log into the arbiter mongod directly to see what data it appears to have.