MongoDB problems recovering a member of the replica set

mongodb

I have a sharded database with 2 replica sets (RS1 and RS2) each one of the RSs with 2 servers. I had a problem yesterday with one member of the RS2, the mongod instance crashed throwing an error. After that I tried to recover the member making it sync with the other member of the replica set (it took a long time to finish the sync) and then I'm getting the same error again:

Tue May  7 12:37:57.023 [rsSync]   Fatal Assertion 16233
0xdcf361 0xd8f0d3 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f57205f2851 0x7f571f99811d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
 /usr/bin/mongod() [0xe17cb9]
 /lib64/libpthread.so.0(+0x7851) [0x7f57205f2851]
 /lib64/libc.so.6(clone+0x6d) [0x7f571f99811d]
Tue May  7 12:37:57.155 [rsSync]

***aborting after fassert() failure


Tue May  7 12:37:57.155 Got signal: 6 (Aborted).

Tue May  7 12:37:57.159 Backtrace:
0xdcf361 0x6cf729 0x7f571f8e2920 0x7f571f8e28a5 0x7f571f8e4085 0xd8f10e 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f57205f2851 0x7f571f99811d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
 /lib64/libc.so.6(+0x32920) [0x7f571f8e2920]
 /lib64/libc.so.6(gsignal+0x35) [0x7f571f8e28a5]
 /lib64/libc.so.6(abort+0x175) [0x7f571f8e4085]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8f10e]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
 /usr/bin/mongod() [0xe17cb9]
 /lib64/libpthread.so.0(+0x7851) [0x7f57205f2851]
 /lib64/libc.so.6(clone+0x6d) [0x7f571f99811d]

Any idea of why this may be happening? How can I make this server sync and work? My last surviving server is now running as secondary, is there a way to make it primary for a while to get the data out of it?

Thanks in advance!

Best Answer

You currently have a member running in secondary, because it cannot form a majority. This is why you should always have an odd number of nodes in a replica set (one can be an arbiter) and I would recommend adding a third node as soon as possible once you get things back to normal.

In terms of how to get the second node up and running, do the following:

  1. Shut down the remaining Secondary server (not strictly necessary, since it is read only as a Secondary, but safer)
  2. Now, copy the whole data directory (everything in dbpath) over to the other host
  3. Restart both nodes

One of the advantages to replica sets (over classic master/slave) is that they are intended to be functionally identical to each other, so you can simply use the data from one "good" node to seed any other "bad" node.