MongoDB – Replication Maintenance Mode with Multiple Tasks in Progress

data synchronizationmaintenancemongodbmongodb-3.2

I have a MongoDB instance where resync is required.

2016-11-07T11:59:23.330+0000 I REPL     [ReplicationExecutor] syncing from: x.x.x.x:27017
2016-11-07T11:59:23.354+0000 W REPL     [rsBackgroundSync] we are too stale to use x.x.x.x:27017 as a sync source
2016-11-07T11:59:23.354+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2016-11-07T11:59:23.354+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-11-07T11:59:23.354+0000 I REPL     [rsBackgroundSync] our last optime : (term: 20, timestamp: Oct  4 07:41:29:1)
2016-11-07T11:59:23.354+0000 I REPL     [rsBackgroundSync] oldest available is (term: 20, timestamp: Oct 17 02:13:33:5)
2016-11-07T11:59:23.354+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2016-11-07T11:59:23.355+0000 I REPL     [ReplicationExecutor] going into maintenance mode with 10333 other maintenance mode tasks in progress

what does this line mean?

[ReplicationExecutor] going into maintenance mode with 10333 other maintenance mode tasks in progress

what are maintenance mode tasks? No docs from MongoDB. Why there are 10333 queued? Howto see them (list)? With a search engine I found also log entries with with 0 other maintenance mode tasks in progress

Best Answer

What are maintenance mode tasks?

The "maintenance mode tasks" message is referring to a counter of successive calls to the replSetMaintenance command and (as at MongoDB 3.4) isn't associated with specific queued tasks. The replSetMaintenance command is used to keep a secondary in RECOVERING state while some maintenance work is done. A RECOVERING member remains online and potentially syncing, but is excluded from normal read operations (eg. using secondary read preferences with a driver). Each invocation of replSetMaintenance either increases the task counter (if true) or decreases it (if false). When the counter reaches 0 the member will transition from RECOVERING back into SECONDARY state assuming it is healthy.

As at MongoDB 3.4, changes in maintenance mode are currently only noted in the MongoDB log. This command is generally only used internally by mongod, but you can invoke it manually as well.

Here's an annotated set of log lines and the associated mongo shell commands showing the task counter changing:

// db.adminCommand({replSetMaintenance: 1})
[ReplicationExecutor] going into maintenance mode with 0 other maintenance mode tasks in progress
[ReplicationExecutor] transition to RECOVERING

// db.adminCommand({replSetMaintenance: 1})
[ReplicationExecutor] going into maintenance mode with 1 other maintenance mode tasks in progress

// db.adminCommand({replSetMaintenance: 0})
[ReplicationExecutor] leaving maintenance mode (1 other maintenance mode tasks ongoing)

// db.adminCommand({replSetMaintenance: 0})
[ReplicationExecutor] leaving maintenance mode (0 other maintenance mode tasks ongoing)
[ReplicationExecutor] transition to SECONDARY

// db.adminCommand({replSetMaintenance: 0})
[ReplicationExecutor] Attempted to leave maintenance mode but it is not currently active

Why there are 10333 queued?

In MongoDB 3.2 a replica set member that becomes "too stale" (i.e. doesn't have any oplog entries in common with another healthy member of the replica set) will remain in RECOVERING mode and periodically check if a new valid sync source is available. Each check currently increments the "maintenance task" counter, so this doesn't actually indicate a meaningful number of tasks if the member has become stale.

In theory "too stale" is not a terminal state as conceivably a member with a larger oplog may temporarily be offline; in practice a "too stale to catch up error" generally means a manual resync is required.

2016-11-07T11:59:23.354+0000 I REPL     [rsBackgroundSync] our last optime : (term: 20, timestamp: Oct  4 07:41:29:1)
2016-11-07T11:59:23.354+0000 I REPL     [rsBackgroundSync] oldest available is (term: 20, timestamp: Oct 17 02:13:33:5)

In this case the replica set member in question went stale almost two weeks earlier, so the maintenance mode counter has continued to creep up over time. There's a related issue in the MongoDB Jira you can watch/upvote: SERVER 23899: Reset maintenance mode when transitioning from too-stale to valid sync source.