Mongodb Secondary Node not recovering

mongodbmongodb-3.4

Secondary node of mongodb cluster has entered in Recovering state and it's not coming out of it. Below is what I see in log. I know one way to fix this issue is to reinitialize secondary node by deleting data directory and restarting secondary. But I don't want to try that option as I have 2 tb of data and primary is getting write continuously.

2017-06-13T12:02:14.946+0000 I REPL [replication-12569] We are too stale to use mongodb.prod.mcse-reporting-olap.services.dal1.prod.walmart.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: 59351d47:3357 is before their earliest timestamp: 593f8b97:5b11 for 1min until: 2017-06-13T12:03:14.946+0000 2017-06-13T12:02:14.946+0000 I REPL [replication-12569] could not find member to sync from 2017-06-13T12:02:14.948+0000 E REPL [rsBackgroundSync] too stale to catch up — entering maintenance mode 2017-06-13T12:02:14.948+0000 I REPL [rsBackgroundSync] Our newest OpTime : { ts: Timestamp 1496653127000|13143, t: 499 } 2017-06-13T12:02:14.948+0000 I REPL [rsBackgroundSync] Earliest OpTime available is { ts: Timestamp 1497336727000|23313, t: 502 } 2017-06-13T12:02:14.948+0000 I REPL [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember 2017-06-13T12:02:14.948+0000 I REPL [rsBackgroundSync] going into maintenance mode with 11386 other maintenance mode tasks in progress

Best Answer

Link in the error message exactly explain what happened.

A replica set member becomes “stale” when its replication process falls so far behind that the primary overwrites oplog entries the member has not yet replicated. The member cannot catch up and becomes “stale.” When this occurs, you must completely resynchronize the member by removing its data and performing an initial sync.

To avoid this in future:

  • You need to investigate why secondary fall behind so much. Possible more writes then normally expected.

  • Your oplog size might not be set up correctly. Once your secondary is behind than the first entry in oplog it will never catch up as it has no way to get those transactions.