Mongodb – Add a new node to Mongo Replicaset using Rsync

data synchronizationmongodbreplication

I have a 3 shard mongo cluster. Each shard has a 3 node replica set. On each shard, the data size is 3 TB.

I want to add 1 more node to each replica set. But its taking lot of time. The initial sync is very slow. It synced 130GB for the whole day.

My Servers have 16GB network speed, very high IOPS SSD disk (everything on GCP).

So I tried to set up the new node with Rsync.

  1. Install and configure mongo db
  2. Stop mongod service.
  3. Add replica set and shard details in the config file.
  4. Run Rsync from an existing secondary node.
  5. Start MongoDB

But its not working. Am I missing something, or is they any alternate approach for this?

Update: Adding more info

MongoDB version: 3.6

Config file on exsiting and new node:

storage:
  dbPath: /mongodb/data
  journal:
    enabled: true

systemLog:
  destination: file
  logAppend: true
  path: /mongodb/logs/mongod.log

net:
  port: 27017
  bindIp: 0.0.0.0
  compression:
    compressors: snappy

security:
  keyFile: /mongodb/mongodb-keyfile

replication:
  replSetName: prod-rpl-1

sharding:
  clusterRole: myshard

MongoDB Log

2019-08-09T14:42:27.604+0000 I CONTROL  [initandlisten] MongoDB starting : pid=23478 port=27017 dbpath=/mongodb/data 64-bit host=Server-new-1
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] db version v3.6.13
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] git version: db3c76679b7a3d9b443a0e1b3e45ed02b88c539f
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] modules: none
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] build environment:
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten]     distmod: rhel70
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten]     distarch: x86_64
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2019-08-09T14:42:27.605+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "0.0.0.0", compression: { compressors: "snappy" }, port: 27017 }, replication: { replSetName: "production-1" }, security: { keyFile: "/mongodb/mongodb-keyfile" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/mongodb/data", journal: { enabled: true } }, systemLog: { destination: "file", logAppend: true, path: "/mongodb/logs/mongod.log" } }
2019-08-09T14:42:27.607+0000 I -        [initandlisten] Detected data files in /mongodb/data created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2019-08-09T14:42:27.607+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=51778M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),compatibility=(release="3.0",require_max="3.0"),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2019-08-09T14:42:28.643+0000 I STORAGE  [initandlisten] WiredTiger message [1565361748:642980][23478:0x7f58ac14fb80], txn-recover: Main recovery loop: starting at 3/768
-----------
-----------
 index rebuilding 

-----------
-----------
2019-08-09T14:47:46.828+0000 I INDEX    [repl writer worker 15] build index on: config.cache.chunks.se4_manage_media_production_aispartner.twitter_auth_apps properties: { v: 2, key: { lastmod: 1 }, name: "lastmod_1", ns: "config.cache.chunks.se4_manage_media_production_aispartner.twitter_auth_apps" }
2019-08-09T14:47:46.828+0000 I INDEX    [repl writer worker 15]          building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-09T14:47:46.830+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2019-08-09T14:47:46.830+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2019-08-09T14:47:46.830+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2019-08-09T14:47:46.831+0000 I REPL     [signalProcessingThread] shutdown: removing all drop-pending collections...
2019-08-09T14:47:46.831+0000 I REPL     [signalProcessingThread] shutdown: removing checkpointTimestamp collection...
2019-08-09T14:47:46.831+0000 I REPL     [signalProcessingThread] shutting down replication subsystems
2019-08-09T14:47:46.831+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host Server-old-1:27017 due to bad connection status; 2 connections to that host remain open
2019-08-09T14:47:46.832+0000 I INDEX    [repl writer worker 15] build index on: config.cache.chunks.se4_manage_media_production_aispartner.twitter_auth_apps properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "config.cache.chunks.se4_manage_media_production_aispartner.twitter_auth_apps" }
2019-08-09T14:47:46.832+0000 I INDEX    [repl writer worker 15]          building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-09T14:47:46.833+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host Server-old-1:27017 due to bad connection status; 1 connections to that host remain open
2019-08-09T14:47:46.897+0000 E REPL     [replication-1] Initial sync attempt failed -- attempts left: 9 cause: CallbackCanceled: error fetching oplog during initial sync: initial syncer is shutting down
2019-08-09T14:47:46.897+0000 I STORAGE  [replication-1] Finishing collection drop for local.temp_oplog_buffer (no UUID).
2019-08-09T14:47:46.903+0000 I REPL     [replication-1] Initial Sync has been cancelled: CallbackCanceled: failed to schedule work _startInitialSyncAttemptCallback-1 at 2019-08-09T14:47:47.897+0000: initial syncer is shutting down
2019-08-09T14:47:46.986+0000 I REPL     [signalProcessingThread] Stopping replication storage threads
2019-08-09T14:47:46.990+0000 W SHARDING [signalProcessingThread] error encountered while cleaning up distributed ping entry for Server-new-1:27017:1565361980:-4300361147585369138 :: caused by :: ShutdownInProgress: Shutdown in progress
2019-08-09T14:47:46.990+0000 W SHARDING [signalProcessingThread] cant reload ShardRegistry  :: caused by :: CallbackCanceled: Callback canceled
2019-08-09T14:47:46.990+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
2019-08-09T14:47:47.001+0000 I STORAGE  [WTOplogJournalThread] oplog journal thread loop shutting down
2019-08-09T14:47:47.004+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2019-08-09T14:47:48.758+0000 I STORAGE  [signalProcessingThread] WiredTiger message [1565362068:758636][23917:0x7faef147d700], txn-recover: Main recovery loop: starting at 7/65416960
2019-08-09T14:47:48.875+0000 I STORAGE  [signalProcessingThread] WiredTiger message [1565362068:875207][23917:0x7faef147d700], txn-recover: Recovering log 7 through 8
2019-08-09T14:47:48.919+0000 I STORAGE  [signalProcessingThread] WiredTiger message [1565362068:919195][23917:0x7faef147d700], txn-recover: Recovering log 8 through 8
2019-08-09T14:47:48.992+0000 I STORAGE  [signalProcessingThread] WiredTiger message [1565362068:992751][23917:0x7faef147d700], txn-recover: Set global recovery timestamp: 0
2019-08-09T14:47:49.065+0000 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2019-08-09T14:47:49.066+0000 I CONTROL  [signalProcessingThread] now exiting
2019-08-09T14:47:49.066+0000 I CONTROL  [signalProcessingThread] shutting down with code:0

Best Answer

  1. What is the usual rate at which data sync happens in your cluster?
  2. Have you stopped the write to the secondaries from which the new member is syncing?

Alternate approach: If your business allows, try to bring down one of the secondary node and then perform the sync from it

Thank you!

-Anban