I have a mongo cluster with 6 replica sets. 5 are fine, one is not. Each replica set has three members. Here is the rs.status()
for it:
{
"set" : "rs_5",
"date" : ISODate("2015-12-16T02:37:39Z"),
"myState" : 5,
"members" : [
{
"_id" : 0,
"name" : "mongo_rs_5_member_1:27018",
"health" : 1,
"state" : 5,
"stateStr" : "STARTUP2",
"uptime" : 33600,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2015-12-16T02:37:38Z"),
"lastHeartbeatRecv" : ISODate("2015-12-16T02:37:37Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "initial sync need a member to be primary or secondary to do our initial sync"
},
{
"_id" : 1,
"name" : "mongo_rs_5_member_2:27019",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 33842,
"optime" : Timestamp(1449898728, 18),
"optimeDate" : ISODate("2015-12-12T05:38:48Z"),
"lastHeartbeat" : ISODate("2015-12-16T02:37:37Z"),
"lastHeartbeatRecv" : ISODate("2015-12-16T02:37:37Z"),
"pingMs" : 3,
"lastHeartbeatMessage" : "still syncing, not yet to minValid optime 566bb328:3"
},
{
"_id" : 2,
"name" : "mongo_rs_5_member_3:27020",
"health" : 1,
"state" : 5,
"stateStr" : "STARTUP2",
"uptime" : 33845,
"optime" : Timestamp(1449898728, 18),
"optimeDate" : ISODate("2015-12-12T05:38:48Z"),
"errmsg" : "still syncing, not yet to minValid optime 566bb327:1",
"self" : true
}
],
"ok" : 1
}
In the logs, I see stuff like:
Wed Dec 16 02:40:34.033 [rsMgr] replSet I don't see a primary and I can't elect myself
and
Tue Dec 15 21:41:27.686 [rsSync] replSet initial sync need a member to be primary or secondary to do our initial sync
Here is rs.conf():
{
"_id" : "rs_5",
"version" : 125967,
"members" : [
{
"_id" : 0,
"host" : "mongo_rs_5_member_1:27018",
"priority" : 3
},
{
"_id" : 1,
"host" : "mongo_rs_5_member_2:27019",
"priority" : 2
},
{
"_id" : 2,
"host" : "mongo_rs_5_member_3:27020"
}
]
}
It has been like this for a number of days. The cpu and the network are showing no real movement indicating that nothing is happening. Obviously, I'd like to not lose data, what do I need to do to get this back to a healthy PRIMARY/SECONDARY/SECONDARY replica set.
Best Answer
I was able to resolve this by Breaking the Mirror. Essentially, I picked one of the members, turned it off, removed the /data/local* files, turned it on, and did a
rs.initiate()
. At this point, I was a replica set of 1 (myself) and primary (obviously). Then, for the other two guys, I turned them off, wiped their entire /data/* files and turned them back on. From the original primary member, I simply added added the two new guys withrs.add("mongo_rs_5_member_1:27018")
andrs.add("mongo_rs_5_member_2:27019")
. Then the primary synced all the content to the other guys (many hours) and the replica set was health. No more errors in the relevant application.