Mongodb Replica Member Server Changed: Status Stuck In STARTUP. Missing config

mongodbreplication

I am on mongo v3.0.x and have a replication set with 3 members, but the second member (and entire server) was unexpectedly taken down and destroyed. I have stood up a new server to use as a replacement for the the second member and need help bringing it back to a working state. I am not sure if this is the proper way to bring a member back in, even though they have the same address (mongochat02).

Do I need to do a rs.reconfig() or remove(mongochat02) and add(mongochat02) again? Or am I supposed to follow a different procedure to get this member working?

When I issue rs.status():

001-rs:PRIMARY> rs.status()
{
    "set" : "001-rs",
    "date" : ISODate("2017-04-07T19:39:23.860Z"),
    "myState" : 1,
    "term" : NumberLong(-1),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "members" : [
        {
            "_id" : 0,
            "name" : "mongochat01:27017",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 174004,
            "optime" : Timestamp(1491593961, 4),
            "optimeDate" : ISODate("2017-04-07T19:39:21Z"),
            "lastHeartbeat" : ISODate("2017-04-07T19:39:22.386Z"),
            "lastHeartbeatRecv" : ISODate("2017-04-07T19:39:21.977Z"),
            "pingMs" : NumberLong(0),
            "syncingTo" : "mongochat03:27017",
            "configVersion" : 3
        },
        {
            "_id" : 1,
            "name" : "mongochat02:27017",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2017-04-07T19:39:23.672Z"),
            "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
            "pingMs" : NumberLong(0),
            "lastHeartbeatMessage" : "Operation timed out",
            "configVersion" : -1
        },
        {
            "_id" : 2,
            "name" : "mongochat03:27017",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 15127641,
            "optime" : Timestamp(1491593963, 2),
            "optimeDate" : ISODate("2017-04-07T19:39:23Z"),
            "electionTime" : Timestamp(1491419961, 1),
            "electionDate" : ISODate("2017-04-05T19:19:21Z"),
            "configVersion" : 3,
            "self" : true
        }
    ],
    "ok" : 1
}

While on mongochat02, when I issue rs.status()

> rs.status()
{
    "info" : "run rs.initiate(...) if not yet done for the set",
    "ok" : 0,
    "errmsg" : "no replset config has been received",
    "code" : 94
}

Connections seem to work up until the heartbeat. In the logs, there is the following error while on mongochat02:

2017-04-06T10:42:11.831-0600 I REPL     [ReplicationExecutor] Error in heartbeat request mongochat01:27017; ExceededTimeLimit: Operation timed out
2017-04-06T10:42:11.911-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:36358 #6671 (151 connections now open)
2017-04-06T10:42:11.947-0600 I REPL     [ReplicationExecutor] Error in heartbeat request to mongochat03:27017; ExceededTimeLimit: Operation timed out

This made me question if the members are unable to communicate with one another:

  • All the members are able to ping one another
  • But mongochat02 is unable to connect to mongochat03/01 through shell

    [root@mongochat02]$ mongo --host mongochat03:27017
    MongoDB shell version: 3.2.9
    connecting to: mongochat03:27017/test
    

    2017-04-07T14:02:03.411-0600 I NETWORK [thread1] Socket recv() errno:110 Connection timed out mongochat03:27017

    2017-04-07T14:02:03.411-0600 I NETWORK [thread1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_ERROR] server [mongochat03:27017]

Mongo log from mongochat01 (currently secondary)

2017-04-11T21:03:36.264+0000 I REPL     [ReplicationExecutor] Member mongochat02:27017 is now in state STARTUP
2017-04-11T21:03:41.779+0000 I NETWORK  [initandlisten] connection accepted from 10.8.129.114:33648 #43227 (104 connections now open)
2017-04-11T21:03:47.896+0000 I NETWORK  [conn43147] end connection 10.8.129.114:33200 (103 connections now open)
2017-04-11T21:03:48.265+0000 I NETWORK  [ReplExecNetThread-2] Socket recv() timeout  10.8.129.114:27017
2017-04-11T21:03:48.265+0000 I NETWORK  [ReplExecNetThread-2] SocketException: remote: 10.8.129.114:27017 error: 9001 socket exception [RECV_TIMEOUT] server [10.8.129.114:27017]
2017-04-11T21:03:48.265+0000 I NETWORK  [ReplExecNetThread-2] DBClientCursor::init call() failed
2017-04-11T21:03:48.265+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; Location10276 DBClientBase::findN: transport error: mongochat02:27017 ns: admin.$cmd query: { replSetHeartbeat: "001-rs", pv: 1, v: 3, from: "mongochat01:27017", fromId: 0, checkEmpty: false }
2017-04-11T21:03:50.380+0000 I REPL     [ReplicationExecutor] Member mongochat02:27017 is now in state STARTUP
2017-04-11T21:03:53.778+0000 I NETWORK  [initandlisten] connection accepted from 10.8.129.114:33654 #43228 (104 connections now open)

mongochat03 log (curent Primary)

 2017-04-11T21:15:45.955+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; ExceededTimeLimit: Operation timed out
    2017-04-11T21:15:46.231+0000 I NETWORK  [initandlisten] connection accepted from 10.8.129.114:55282 #1148708 (108 connections now open)
    2017-04-11T21:15:48.041+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat02:27017
    2017-04-11T21:15:52.764+0000 I NETWORK  [conn1148596] end connection 10.8.129.114:54834 (107 connections now open)
    2017-04-11T21:15:57.956+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; ExceededTimeLimit: Operation timed out
    2017-04-11T21:15:58.231+0000 I NETWORK  [initandlisten] connection accepted from 10.8.129.114:55288 #1148709 (108 connections now open)
    2017-04-11T21:15:59.902+0000 I NETWORK  [conn1148706] end connection 10.1.240.185:58407 (107 connections now open)
    2017-04-11T21:15:59.933+0000 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:58560 #1148710 (108 connections now open)
    2017-04-11T21:16:00.041+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat02:27017
    2017-04-11T21:16:04.540+0000 I NETWORK  [conn1148597] end connection 10.8.129.114:54842 (107 connections now open)

mongo.conf from monogchat01. They are all pretty much the same except the "bindIp" is the server ip.

# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

# Where and how to store data.
storage:
  dbPath: /opt/mongo
  journal:
    enabled: true
#  engine:
#  mmapv1:
#  wiredTiger:

# how the process runs
processManagement:
  fork: true  # fork and run in background
  pidFilePath: /var/run/mongodb/mongod.pid  # location of pidfile

# network interfaces
net:
  port: 27017
  bindIp: 127.0.0.1,10.1.240.185  # Listen to local interface only, comment to listen on all interfaces.


#security:

#operationProfiling:

#replication:
replication:
  replSetName: "001-rs"
#sharding:

## Enterprise-Only Options

#auditLog:

#snmp:

Logs on mongochat02

2017-04-12T09:00:18.782-0600 I CONTROL  [main] ***** SERVER RESTARTED *****
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] MongoDB starting : pid=27788 port=27017 dbpath=/opt/mongo 64-bit host=den02vmchat02
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] db version v3.2.9
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] git version: 22ec9e93b40c85fc7cae7d56e7d6a02fd811088c
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] allocator: tcmalloc
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] modules: none
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] build environment:
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten]     distmod: rhel70
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten]     distarch: x86_64
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten]     target_arch: x86_64
2017-04-12T09:00:18.789-0600 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, replication: { replSetName: "001-rs" }, storage: { dbPath: "/opt/mongo", journal: { enabled: true } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2017-04-12T09:00:18.809-0600 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.
2017-04-12T09:00:19.631-0600 I CONTROL  [initandlisten]
2017-04-12T09:00:19.814-0600 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2017-04-12T09:00:19.814-0600 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2017-04-12T09:00:19.814-0600 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/opt/mongo/diagnostic.data'
2017-04-12T09:00:19.814-0600 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2017-04-12T09:00:20.185-0600 I NETWORK  [initandlisten] waiting for connections on port 27017
2017-04-12T09:00:20.561-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.207:57560 #1 (1 connection now open)
2017-04-12T09:00:20.603-0600 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongochat03:27017
2017-04-12T09:00:21.247-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51255 #2 (2 connections now open)
2017-04-12T09:00:21.291-0600 I NETWORK  [conn2] end connection 10.1.240.185:51255 (1 connection now open)
2017-04-12T09:00:21.424-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51256 #3 (2 connections now open)
2017-04-12T09:00:21.528-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51257 #4 (3 connections now open)
2017-04-12T09:00:21.632-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51258 #5 (4 connections now open)
2017-04-12T09:00:21.736-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51259 #6 (5 connections now open)
2017-04-12T09:00:21.834-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51260 #7 (6 connections now open)
2017-04-12T09:00:21.839-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51261 #8 (7 connections now open)
2017-04-12T09:00:21.877-0600 I NETWORK  [conn7] end connection 10.1.240.185:51260 (6 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK  [conn3] end connection 10.1.240.185:51256 (5 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK  [conn4] end connection 10.1.240.185:51257 (4 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK  [conn5] end connection 10.1.240.185:51258 (3 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK  [conn6] end connection 10.1.240.185:51259 (2 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK  [conn8] end connection 10.1.240.185:51261 (1 connection now open)
2017-04-12T09:00:21.929-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51262 #9 (2 connections now open)
2017-04-12T09:00:21.929-0600 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to chat.zayo.us:27017
2017-04-12T09:00:21.947-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51263 #10 (3 connections now open)
2017-04-12T09:00:22.051-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51264 #11 (4 connections now open)
2017-04-12T09:00:22.154-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51265 #12 (5 connections now open)
2017-04-12T09:00:22.258-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51266 #13 (6 connections now open)
2017-04-12T09:00:22.362-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51267 #14 (7 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK  [conn10] end connection 10.1.240.185:51263 (6 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK  [conn12] end connection 10.1.240.185:51265 (6 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK  [conn13] end connection 10.1.240.185:51266 (4 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK  [conn14] end connection 10.1.240.185:51267 (3 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK  [conn11] end connection 10.1.240.185:51264 (6 connections now open)
2017-04-12T09:00:22.990-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51268 #15 (3 connections now open)
2017-04-12T09:00:23.022-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51270 #16 (4 connections now open)
2017-04-12T09:00:23.051-0600 I NETWORK  [conn15] end connection 10.1.240.185:51268 (3 connections now open)
2017-04-12T09:00:23.092-0600 I NETWORK  [conn16] end connection 10.1.240.185:51270 (2 connections now open)
2017-04-12T09:00:25.397-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51271 #17 (3 connections now open)
2017-04-12T09:00:25.441-0600 I NETWORK  [conn17] end connection 10.1.240.185:51271 (2 connections now open)
2017-04-12T09:00:25.776-0600 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat03:27017
2017-04-12T09:00:25.861-0600 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to chat.zayo.us:27017
2017-04-12T09:00:25.994-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51272 #18 (3 connections now open)
2017-04-12T09:00:26.003-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51273 #19 (4 connections now open)
2017-04-12T09:00:26.037-0600 I NETWORK  [conn18] end connection 10.1.240.185:51272 (3 connections now open)
2017-04-12T09:00:26.107-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51274 #20 (4 connections now open)
2017-04-12T09:00:26.211-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51275 #21 (5 connections now open)
2017-04-12T09:00:26.314-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51276 #22 (6 connections now open)
2017-04-12T09:00:26.418-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51277 #23 (7 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK  [conn19] end connection 10.1.240.185:51273 (6 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK  [conn20] end connection 10.1.240.185:51274 (5 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK  [conn21] end connection 10.1.240.185:51275 (4 connections now open)
2017-04-12T09:00:26.462-0600 I NETWORK  [conn22] end connection 10.1.240.185:51276 (3 connections now open)
2017-04-12T09:00:26.462-0600 I NETWORK  [conn23] end connection 10.1.240.185:51277 (2 connections now open)
2017-04-12T09:00:26.514-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51278 #24 (3 connections now open)
2017-04-12T09:00:26.618-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51279 #25 (4 connections now open)
2017-04-12T09:00:26.722-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51280 #26 (5 connections now open)
2017-04-12T09:00:26.826-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51281 #27 (6 connections now open)
2017-04-12T09:00:26.931-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51282 #28 (7 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK  [conn24] end connection 10.1.240.185:51278 (6 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK  [conn25] end connection 10.1.240.185:51279 (5 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK  [conn26] end connection 10.1.240.185:51280 (4 connections now open)
2017-04-12T09:00:26.975-0600 I NETWORK  [conn27] end connection 10.1.240.185:51281 (3 connections now open)
2017-04-12T09:00:26.975-0600 I NETWORK  [conn28] end connection 10.1.240.185:51282 (3 connections now open)
2017-04-12T09:00:27.161-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51283 #29 (3 connections now open)
2017-04-12T09:00:27.200-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51284 #30 (4 connections now open)
2017-04-12T09:00:27.203-0600 I NETWORK  [conn29] end connection 10.1.240.185:51283 (3 connections now open)
2017-04-12T09:00:27.244-0600 I NETWORK  [conn30] end connection 10.1.240.185:51284 (2 connections now open)
2017-04-12T09:00:29.553-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51285 #31 (3 connections now open)
2017-04-12T09:00:29.596-0600 I NETWORK  [conn31] end connection 10.1.240.185:51285 (2 connections now open)
2017-04-12T09:00:30.145-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51286 #32 (3 connections now open)
2017-04-12T09:00:30.188-0600 I NETWORK  [conn32] end connection 10.1.240.185:51286 (2 connections now open)
2017-04-12T09:00:30.569-0600 I NETWORK  [initandlisten] connection accepted from 10.1.240.185:51288 #33 (3 connections now open)
2017-04-12T09:00:30.603-0600 I REPL     [ReplicationExecutor] Error in heartbeat request to mongochat03:27017; ExceededTimeLimit: Operation timed out

Best Answer

Yes, problems is connection between nodes using port 27017. So, check firewall, because something is blocking traffic between nodes.

Fastest and best way to bring up that mongochat02 node is make snapshot of mongochat01 or 03 (dbpath) filesystem and copy that to mongochat02.

Sure, you can make initial sync too, but it is the slowest way. Creating empty mongodb node and set replset to right name.

Anyway, as long as this mongochat02 is on same (old) IP address, you don't need to remove and add it to the replica set.