Mongodb – mongo converting standalone to replica set causes services down

mongodbreplication

I have standalone node1 already running for a long time. Now I want to convert it to replica set. I followed steps in https://docs.mongodb.org/v2.4/tutorial/convert-standalone-to-replica-set/ to convert standalone to replica set.

After I added a member in mongo client connnected to primary, command line prompt changed from ReplicaSet0:PRIMARY> to ReplicaSet0:SECONDARY>. Then I found my production's services were down.

I check my sentry(an error collection service), and found lots of errors my Ruby code threw:


Moped::Errors::ConnectionFailure: Could not connect to a primary node for replica set #<Moped::Cluster:28315820 @seeds=[<Moped::Node resolved_address="10.128.129.90:27017">, <Moped::Node resolved_address="10.128.130.139:27017">]>

These are my operation and mongo output:

ReplicaSet0:PRIMARY> rs.add("node2")
{ "ok" : 1 }
ReplicaSet0:PRIMARY> rs.conf()
{
  "_id" : "ReplicaSet0",
  "version" : 2,
  "members" : [
    {
      "_id" : 0,
      "host" : "node1:27017"
    },
    {
      "_id" : 1,
      "host" : "node2:27017"
    }
  ]
}
ReplicaSet0:PRIMARY> rs.status()
Thu Oct 22 15:40:13.762 DBClientCursor::init call() failed
Thu Oct 22 15:40:13.763 Error: error doing query: failed at src/mongo/shell/query.js:78
Thu Oct 22 15:40:13.763 trying reconnect to 127.0.0.1:27017
Thu Oct 22 15:40:13.764 reconnect 127.0.0.1:27017 ok
ReplicaSet0:SECONDARY> 

You can see PRIMARY became to SECONDARY.
Why did it happen? I think it caused my services down. And How can I avoid it? Please help me.

Update0:

mongo.conf(Yeah, that is all.)

dbpath=/data/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
bind_ip = 0.0.0.0
journal=true
replSet=ReplicaSet0

Update1: rs.status()

ReplicaSet0:SECONDARY> rs.status()
{
  "set" : "ReplicaSet0",
  "date" : ISODate("2015-10-22T07:58:14Z"),
  "myState" : 2,
  "members" : [
    {
      "_id" : 0,
      "name" : "kuankr:27017",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 2463,
      "optime" : Timestamp(1445499598, 19),
      "optimeDate" : ISODate("2015-10-22T07:39:58Z"),
      "self" : true
    },
    {
      "_id" : 1,
      "name" : "mongo-primary:27017",
      "health" : 0,
      "state" : 8,
      "stateStr" : "(not reachable/healthy)",
      "uptime" : 0,
      "optime" : Timestamp(0, 0),
      "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
      "lastHeartbeat" : ISODate("2015-10-22T07:58:13Z"),
      "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
      "pingMs" : 0
    }
  ],
  "ok" : 1
}

pick some relevant rows from mongodb.log

34325 Thu Oct 22 15:39:52.139 [conn397] replSet replSetReconfig config object parses ok, 2 members specified 
34326 Thu Oct 22 15:39:54.599 [conn397] replSet replSetReconfig [2]
34327 Thu Oct 22 15:39:54.599 [conn397] replSet info saving a newer config version to local.system.replset
34328 Thu Oct 22 15:39:54.607 [conn397] replSet saveConfigLocally done 
34329 Thu Oct 22 15:39:54.607 [conn397] replSet info : additive change to configuration
34330 Thu Oct 22 15:39:54.607 [conn397] replSet replSetReconfig new config saved locally
34331 Thu Oct 22 15:39:54.607 [conn397] command admin.$cmd command: { replSetReconfig: { _id: "ReplicaSet0", version: 2, members: [ { _id: 0, host: "kuankr:27017" }, { _id:       1.0, host: "mongo-primary" } ] } } ntoreturn:1 keyUpdates:0 locks(micros) W:8249 reslen:37 2467ms
34332 Thu Oct 22 15:39:54.612 [rsHealthPoll] replSet member mongo-primary:27017 is up
34333 Thu Oct 22 15:39:54.612 [rsMgr] replSet total number of votes is even - add arbiter or give one member an extra vote
34334 Thu Oct 22 15:40:08.610 [rsHealthPoll] DBClientCursor::init call() failed
34335 Thu Oct 22 15:40:08.750 [rsHealthPoll] replSet info mongo-primary:27017 is down (or slow to respond):
34336 Thu Oct 22 15:40:08.750 [rsHealthPoll] replSet member mongo-primary:27017 is now in state DOWN
34337 Thu Oct 22 15:40:08.750 [rsMgr] can't see a majority of the set, relinquishing primary
34338 Thu Oct 22 15:40:08.750 [rsMgr] replSet relinquishing primary state
34339 Thu Oct 22 15:40:08.750 [rsMgr] replSet SECONDARY
34340 Thu Oct 22 15:40:08.750 [rsMgr] replSet closing client sockets after relinquishing primary
34341 Thu Oct 22 15:40:08.751 [conn4] end connection 10.128.132.214:47738 (61 connections now open) * 
34402 Thu Oct 22 15:40:08.755 [conn385] end connection 127.0.0.1:35975 (1 connection now open)
34414 Thu Oct 22 15:40:15.895 [rsMgr] replSet info electSelf 0
34415 Thu Oct 22 15:40:15.896 [rsMgr] replSet couldn't elect self, only received 1 votes
34425 Thu Oct 22 15:40:21.897 [rsMgr] replSet info electSelf 0
34426 Thu Oct 22 15:40:21.897 [rsMgr] replSet couldn't elect self, only received 1 votes
34465 Thu Oct 22 15:40:35.897 [rsHealthPoll] DBClientCursor::init call() failed
34466 Thu Oct 22 15:40:35.898 [rsHealthPoll] replSet info mongo-primary:27017 is down (or slow to respond):
34467 Thu Oct 22 15:40:35.898 [rsMgr] replSet can't see a majority, will not try to elect self
34503 Thu Oct 22 15:40:43.899 [rsHealthPoll] replSet member mongo-primary:27017 is up
34504 Thu Oct 22 15:40:43.899 [rsMgr] replSet info electSelf 0
34505 Thu Oct 22 15:40:43.900 [rsMgr] replSet couldn't elect self, only received 1 votes

Best Answer

Basically, the scenario from the comments is what has happened here. You have added a new host to the set (mongo-primary) and this host is not reachable from your original host (kuankr). That means that you have a replica set with 2 hosts, but only one healthy. When that occurs you cannot satisfy the requirement for electing a primary - which is that >50% of the votes (or a strict majority) are required to elect a primary.

In a 2 node set, both nodes must be available and voting to elect a primary. In a 3 node set, you need 2 out of 3, in a 4 node set you need 3 out of 4, in a 5 node set you need 3 out of 5 etc.

This is why it is always recommended to have an odd number of nodes in your set. I would recommend adding an arbiter that can be reached by your original primary so that it can be elected again. Then, with the immediate problem solved, work out why the original primary cannot talk to the new node (most common issues: firewall, routing, incorrect bind IP on new node).

Update based on comments:

To force an addition if the helpers will not work on a secondary, then you can do something like this:

cfg = rs.conf()
// here's what my sample config member array looks like - adjust as necessary
> cfg.members 
[
    {
        "_id" : 0,
        "host" : "mongod_A.example.net:27017"
    },
    {
        "_id" : 1,
        "host" : "mongod_B.example.net:27017"
    }
]
// let's manually add an arbiter
> cfg.members[2] = {
... "_id" : 2,
... "host": "arbiter:27017",
... "arbiter": true
... }
// now our cfg object looks like this
> cfg
{
    "_id" : "rs",
    "version" : 7,
    "members" : [
        {
            "_id" : 0,
            "host" : "mongod_A.example.net:27017"
        },
        {
            "_id" : 1,
            "host" : "mongod_B.example.net:27017"
        },
        {
            "_id" : 2,
            "host" : "arbiter:27017",
            "arbiter" : true
        }
    ]
}
// Finally, reconfigure with force on the secondary
rs.reconfig(cfg, {force : true})

You can also just remove the "bad" node using this similar procedure