Mongodb – Mongo replica set with no PRIMARY/SECONDARY, members are STARTUP2 and RECOVERING

mongodb

I have a mongo cluster with 6 replica sets. 5 are fine, one is not. Each replica set has three members. Here is the rs.status() for it:

{
    "set" : "rs_5",
    "date" : ISODate("2015-12-16T02:37:39Z"),
    "myState" : 5,
    "members" : [
        {
            "_id" : 0,
            "name" : "mongo_rs_5_member_1:27018",
            "health" : 1,
            "state" : 5,
            "stateStr" : "STARTUP2",
            "uptime" : 33600,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2015-12-16T02:37:38Z"),
            "lastHeartbeatRecv" : ISODate("2015-12-16T02:37:37Z"),
            "pingMs" : 0,
            "lastHeartbeatMessage" : "initial sync need a member to be primary or secondary to do our initial sync"
        },
        {
            "_id" : 1,
            "name" : "mongo_rs_5_member_2:27019",
            "health" : 1,
            "state" : 3,
            "stateStr" : "RECOVERING",
            "uptime" : 33842,
            "optime" : Timestamp(1449898728, 18),
            "optimeDate" : ISODate("2015-12-12T05:38:48Z"),
            "lastHeartbeat" : ISODate("2015-12-16T02:37:37Z"),
            "lastHeartbeatRecv" : ISODate("2015-12-16T02:37:37Z"),
            "pingMs" : 3,
            "lastHeartbeatMessage" : "still syncing, not yet to minValid optime 566bb328:3"
        },
        {
            "_id" : 2,
            "name" : "mongo_rs_5_member_3:27020",
            "health" : 1,
            "state" : 5,
            "stateStr" : "STARTUP2",
            "uptime" : 33845,
            "optime" : Timestamp(1449898728, 18),
            "optimeDate" : ISODate("2015-12-12T05:38:48Z"),
            "errmsg" : "still syncing, not yet to minValid optime 566bb327:1",
            "self" : true
        }
    ],
    "ok" : 1
}

In the logs, I see stuff like:

Wed Dec 16 02:40:34.033 [rsMgr] replSet I don't see a primary and I can't elect myself

and

Tue Dec 15 21:41:27.686 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync

Here is rs.conf():

{
    "_id" : "rs_5",
    "version" : 125967,
    "members" : [
        {
            "_id" : 0,
            "host" : "mongo_rs_5_member_1:27018",
            "priority" : 3
        },
        {
            "_id" : 1,
            "host" : "mongo_rs_5_member_2:27019",
            "priority" : 2
        },
        {
            "_id" : 2,
            "host" : "mongo_rs_5_member_3:27020"
        }
    ]
}

It has been like this for a number of days. The cpu and the network are showing no real movement indicating that nothing is happening. Obviously, I'd like to not lose data, what do I need to do to get this back to a healthy PRIMARY/SECONDARY/SECONDARY replica set.

Best Answer

I was able to resolve this by Breaking the Mirror. Essentially, I picked one of the members, turned it off, removed the /data/local* files, turned it on, and did a rs.initiate(). At this point, I was a replica set of 1 (myself) and primary (obviously). Then, for the other two guys, I turned them off, wiped their entire /data/* files and turned them back on. From the original primary member, I simply added added the two new guys with rs.add("mongo_rs_5_member_1:27018") and rs.add("mongo_rs_5_member_2:27019"). Then the primary synced all the content to the other guys (many hours) and the replica set was health. No more errors in the relevant application.

Related Solutions

MongoDB failover reasons

Here is what you can do to start

Run the following on the old PRIMARY

var dt = new Date(db.serverStatus().localTime - db.serverStatus().uptime*1000).toString() ; dt

This will print the exact time mongod was started.

PRIMARY failover was triggered 2015-03-03T12:18:41.540-0500.

If dt is close to 2015-03-03T12:18:41.540-0500
- mongod crashed or was restarted
- Check the mongod Log File on the old PRIMARY for that same timeframe
If dt is not close to 2015-03-03T12:18:41.540-0500
- mongod is still running fine
- The old PRIMARY simply became unreachable
- Check network logs along port 27017

Mongodb – Syncing data from primary member to secondary after election

It depends how far behind it fell and what status you have it in your config file at the time of resync.

Basically if you've been down so long that it can't catch up with the OpsLog or the OpsLog no longer has the transactions it needs due to it being removed over time you'll want to resync yourself.

Check out the docs for your version:

Restart the mongod with an empty data directory and let MongoDB’s normal initial syncing feature restore the data. This is the more simple option but may take longer to replace the data.

See Procedures.

Restart the machine with a copy of a recent data directory from another member in the replica set. This procedure can replace the data more quickly but requires more manual steps.

See Sync by Copying Data Files from Another Member.

Let us know if you have more questions. Good luck. Test test test!

Best Answer

Related Solutions

MongoDB failover reasons

Mongodb – Syncing data from primary member to secondary after election

Related Question