MongoDB failover when master is stuck on IOWait

failovermongodb

We host a MongoDB 3.4 replica set in AWS with three nodes: a primary, secondary failover and an arbiter. Normally, if the primary instance dies, the failover to the secondary is pretty quick (10-30 seconds).

Today we had a network issue where the MongoDB primary instance lost connectivity with the disk containing the database for about 3 minutes and CPU IOWait went to 100%. During this time, queries to the primary just hung and went into timeout. Probably because the primary was still up (though unresponsive), the replica set did not failover or even start a vote.

Is there a configuration which would produce a failover also in such cases? Or is there some ready tools that could force a failover if simple queries to the primary node start taking too long?

Best Answer

A closely related question is discussed extensively in the comments on SERVER-14139, a bug report against filed against mongodb. To summarize, it is not feasible to build a fully general hang detection system inside a server process.

The commentary in discuss a monitoring approach that kills either the process or shuts down the operating system and can be done using a cron job or the watchdog daemon. Because a mongod process cannot win an election before it has read and written some data to its storage engine, it is safe to immediately attempt to restart mongod after you kill it. The restarted process should not accept connections and certainly will not be able to win an election for primary.

A ticket linked to SERVER-14139 covers an implementation in the enterprise (non-free) version of MongoDB of a storage watchdog timer. Organizations that can instead use the watchdog daemon or an external monitoring process should prefer that approach, because it can protect against more kinds of resource failure.

Related Question