MongoDB can’t add new replica set member [rsSync] SEVERE: Got signal: 6

data synchronizationmongodbreplication

I have a replica set and I added a new member to the set. The initialSync begins for the new member and rs.status (on primary) shows STARTUP2 as status. However, after a long enough time, there's a cryptic fassert error coming on the new instance.

Log dump is as following:

2014-11-02T22:53:23.995+0000 [clientcursormon] mem (MB) res:330 virt:45842
2014-11-02T22:53:23.995+0000 [clientcursormon]  mapped (incl journal view):45038
2014-11-02T22:53:23.995+0000 [clientcursormon]  connections:27
2014-11-02T22:53:23.995+0000 [clientcursormon]  replication threads:32
2014-11-02T22:53:25.427+0000 [conn2012] end connection xx.xx.xx.xx:1201 (26 connections now open)
2014-11-02T22:53:25.433+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1200 #2014 (27 connections now open)
2014-11-02T22:53:25.436+0000 [conn2014]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:26.775+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1058 #2015 (28 connections now open)
2014-11-02T22:53:26.864+0000 [conn1993] end connection xx.xx.xx.xx:1059 (27 connections now open)
2014-11-02T22:53:29.090+0000 [rsSync] Socket recv() errno:110 Connection timed out xx.xx.xx.xx:27017
2014-11-02T22:53:29.096+0000 [rsSync] SocketException: remote: xx.xx.xx.xx:27017 error: 9001 socket exception [RECV_ERROR] server [168.63.252.61:27017] 
2014-11-02T22:53:29.099+0000 [rsSync] DBClientCursor::init call() failed
2014-11-02T22:53:29.307+0000 [rsSync] replSet initial sync exception: 13386 socket error for mapping query 0 attempts remaining
2014-11-02T22:53:36.113+0000 [conn2013] end connection xx.xx.xx.xx:1056 (26 connections now open)
2014-11-02T22:53:36.153+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1137 #2016 (27 connections now open)
2014-11-02T22:53:36.154+0000 [conn2016]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:55.541+0000 [conn2014] end connection xx.xx.xx.xx:1200 (26 connections now open)
2014-11-02T22:53:55.578+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1201 #2017 (27 connections now open)
2014-11-02T22:53:55.580+0000 [conn2017]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:56.861+0000 [conn2015]  authenticate db: admin { authenticate: 1, user: "root", nonce: "xxx", key: "xxx" }
2014-11-02T22:53:59.310+0000 [rsSync] Fatal Assertion 16233
2014-11-02T22:53:59.723+0000 [rsSync] 0x11c0e91 0x1163109 0x114576d 0xe84c1f 0xea770e 0xea7800 0xea7af8 0x1205829 0x7ff728cf8e9a 0x7ff72800b3fd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x114576d]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xe84c1f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x18e) [0xea770e]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x30) [0xea7800]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0xa8) [0xea7af8]
 /usr/bin/mongod() [0x1205829]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7ff728cf8e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff72800b3fd]
2014-11-02T22:53:59.723+0000 [rsSync] 

***aborting after fassert() failure


2014-11-02T22:53:59.728+0000 [rsSync] SEVERE: Got signal: 6 (Aborted)

.

The worst part is that when I try to re-start the mongod service, the replication begins afresh trying to reSync all the files which are already there. This seems bizarre and useless.

Can someone please help me understand what is going on?

Best Answer

In the dump you have above, I see this error

SocketException: remote: xx.xx.xx.xx:27017 error: 9001 socket exception [RECV_ERROR] server [168.63.252.61:27017] 2014-11-02T22:53:29.099+0000 [rsSync] DBClientCursor::init call() failed 2014-11-02T22:53:29.307+0000 [rsSync] replSet initial sync exception: 13386 socket error for mapping query 0 attempts remaining 2014-11-02T22:53:36.113+0000 [conn2013] end connection xx.xx.xx.xx:

Which indicates that there is a socket exception error. To me that means that the configuration of a replica is incorrect or the host definded in the replica set configuration can't be contacted (firewall, network issues, mongodb not running)

STARTUP2 indicates that threads were forked for replication. While in the STARTUP2 state, the member creates threads to handle internal replication operations, and then the member transitions to the:replstate:RECOVERING state. During initial sync, members stay in:replstate:STARTUP2 until the all data transfers and all indexes finishing building. Members in the STARTUP2 state are not eligible to vote.

Running rs.status() from the primary will show you how the primary sees the secondaries and what address (host/port) it's using to contact them. You can also run rs.status() from a secondary after running the rs.slaveOK() command.

Try connnecting from the secondary to the primary and vis-versa using the mongo client. Or you can use telnet mongo_host mongo_port_number from either server to see if both are listening to the ports you've defined.