MongoDB failover when master is stuck on IOWait

failovermongodb

We host a MongoDB 3.4 replica set in AWS with three nodes: a primary, secondary failover and an arbiter. Normally, if the primary instance dies, the failover to the secondary is pretty quick (10-30 seconds).

Today we had a network issue where the MongoDB primary instance lost connectivity with the disk containing the database for about 3 minutes and CPU IOWait went to 100%. During this time, queries to the primary just hung and went into timeout. Probably because the primary was still up (though unresponsive), the replica set did not failover or even start a vote.

Is there a configuration which would produce a failover also in such cases? Or is there some ready tools that could force a failover if simple queries to the primary node start taking too long?

Best Answer

A closely related question is discussed extensively in the comments on SERVER-14139, a bug report against filed against mongodb. To summarize, it is not feasible to build a fully general hang detection system inside a server process.

The commentary in discuss a monitoring approach that kills either the process or shuts down the operating system and can be done using a cron job or the watchdog daemon. Because a mongod process cannot win an election before it has read and written some data to its storage engine, it is safe to immediately attempt to restart mongod after you kill it. The restarted process should not accept connections and certainly will not be able to win an election for primary.

A ticket linked to SERVER-14139 covers an implementation in the enterprise (non-free) version of MongoDB of a storage watchdog timer. Organizations that can instead use the watchdog daemon or an external monitoring process should prefer that approach, because it can protect against more kinds of resource failure.

Related Solutions

Mongodb – Can arbiterOnly replica in MongoDB become SECONDARY and what it means

Some time ago my one of my data replicas which was secondary at a time crushed due to hard drive failure. After I fixed that problem and restarted secondary it went into “Recovering” state. But my arbiter is now “Secondary”

A MongoDB arbiter cannot automatically become a secondary or a primary node, as it does not have a copy of the data set.

If you try to manually reconfigure the arbiter as a regular node via rs.reconfig() you should get an exception similar to:

{
    "errmsg" : "exception: arbiterOnly may not change for members",
    "code" : 13510,
    "ok" : 0
}

Furthermore, the data directory for this Arbiter shows bunch of files of total size >10GB. They indeed look to me like data files. Are they? What is going to happen to these files when Recovery completes?

Assuming this node is definitely an arbiter, I would expect those files are unused (check the timestamps?) and are either:

a local database created if this node was incorrectly initialized with an oplog
unused copy of data directory if this node was copied from another secondary or used as a standalone before

You can always log into the arbiter mongod directly to see what data it appears to have.

MongoDB: Manual failover and failback between two datacenters

I do not understand why the failover to DC2 has to be done manually (even if other parts have to be done manually: one thing less on your to do list in case of a major failure is always a good thing!).

In general, my feeling is that there are conceptual flaws in your setup.

Here is how I would do it and why.

I would not have manual failover. It is better to have slow access than none. What will happen in the current configuration is that if the primary fails, there will be a tie and therefor the whole set would enter secondary state, effectively turning the cluster into read-only mode. So even when everything else is fine in DC1 and there is no need for failing over to DC2, a failing primary will be a show stopper. With setup, you are artificially creating a single point of failure, effectively gainst the whole idea of a cluster, let alone a multi DC setup. Sounds like a Very Bad Idea™ to me. Automatic failover, even to DC2 sounds like a better idea. Slower reads and (depending on your write concern) slower writes still are better than read only mode.
I would have a third datacenter with only one instance: an arbiter. An arbiter can easily be run on a micro-machine as it will only be called in case of an election and an election is a cheap task in terms of RAM and computation power. The arbiter will help the set to always have a majority: If one DC gets disconnected for whatever reason, the other DC and the arbiter will form a majority. So if one DC goes down, you have only to worry about your other parts of your application. You don't have to wait
I am pretty sure that automatic failover for the other parts of your application can be achieved with some time and effort. Especially if you store all data in mongoDB and you have some sort of session replication available, it should be quite easy. Whether implementing automatic failover is worth the effort is pretty easy to calculate: Get your average downtime, find out how big the losses are created by this downtime in terms of money and customer satisfaction (if applicable). If the costs of implementing automatic failover is below or equal, go for automatic failover. I can help you with that if needed.

Best Answer

Related Solutions

Mongodb – Can arbiterOnly replica in MongoDB become SECONDARY and what it means

MongoDB: Manual failover and failback between two datacenters

Related Question