MongoDB – Do Disk Errors on Primary Instance Trigger Election Process?

mongodb

On the MongoDB site here states that a elections occurs when primary becomes unavailable. What happens if there is a disk read/write error on the primary node? Does heartbeat check covers disk health, system load or other checks or just covers connectivity?

Best Answer

It depends on exactly what you mean when you talk about a disk read / write error. If the error is that the data db volume is read-only or unmounted (e.g. An EBS is removed from a running EC2 instance) from a MongoDB primary will not trigger an election process as long as the mongod primary remains running and is exchanging heartbeats, then no failover will occur. The primary will not accept writes but it will not step down in this situation.

The heartbeat checks connectivity between members within the replica set and you can look at the code for it here. It is sent every 2 seconds and will timeout after 10 seconds.

Related Solutions

Mongodb – What happens if there are too many inserts in MongoDB? How to ensure all data is stored

You have answered some of your own questions here, specifically you have a decent idea about the write lock aspect of the equation - 12,000 insert/sec gets you to ~60% write lock. That's a reasonable level to get consistent performance - you will be getting some contention, and some ops will be a little slower, but you really want to start worrying at about 80% - like a lot of things, when you start exceeding 80% available capacity you will start hitting issues a lot more often.

In terms of other bottlenecks, and specifically how quickly you can write to disk - this can cause problems, but to look at the relevant stats over time I would recommend getting MMS installed with the munin-node plugin to give you hardware and IO stats in addition to the MongoDB stats.

When you have that, the metrics you will want to keep an eye on are:

The Average Flush time (this is how long MongoDB's periodic sync to disk is taking)
The IOStats in the hardware tab (IOWait in particular)
Page Faults (if your disk is busy with writes and you need to read data, they are going to be competing for a scarce resource)

It's a bit complicated then, but here's a basic idea:

When average flush time starts to increase, be worried
If it gets into the multiple second range, you are probably on the limit (though this depends on the volume of data written and the disk speed)
If it approaches 60 seconds you will see performance degrade severely (the flush happens every 60 seconds, so they would essentially be queuing up)
High IOWait is going to hinder performance too, especially if you have to read from disk at any point
Hence looking at page fault levels will also be important

The other piece of this puzzle, which we have not mentioned yet, is the journal. That will be persisting data to disk as well (by default every 100ms) and so it will be adding to the disk's load if it is on the same volume. Hence if you are seeing high disk utilization, then moving the journal off to another disk would be a good idea.

There are no real "magic numbers" to stay under, in most cases it's all relative, so get a good baseline for your normal traffic, check to see if things are trending up and maybe load test to see what your limits are and when things start to degrade and you will be in good shape.

After all that pre-amble, on to some of your questions:

What happens if there are more inserts per second than mongod is able to save to the hard disk? Will there be any warning or will it simply fail silently?

If you start to stress the disk to the levels described above, eventually everything is going to slow down and at some point (and this will depend on time outs, how beefy your hardware is, how you handle exceptions) your writes will fail - if you are using a recent version of pymongo then you will be using safe writes by default and those will then fail. If you wish to be a little more paranoid, you can occasionally do a write concern of j:true which will wait to return OK until the write has made it to the journal (i.e. on disk). This will, of course, be slower than a normal safe write, but it will be an immediate indication of disk capacity related issues, and you could use it to block/queue other operations and essentially act as a throttle to prevent your database from being overwhelmed.

I am thinking about a simple replication setup using one master and one slave. Does the initial sync or a resync process lock the databases?

I think I covered locking overall at the start, but to answer this piece specifically: First, make sure you are using a replica set, not master/slave. The master/slave implementation is deprecated and not recommended for use in general. As for the initial sync will add some load to the primary in terms of reads, but not in terms of writes, so you should be fine in terms of locking.

What happens to my data if the write queue increases on long term?

As you can probably tell from the explanation above, the answer is very much dependent on how you write your application, how you choose to have your writes acknowledged and how much capacity you have available. You can, essentially, be as safe as you wish when it comes to writing to disk on MongoDB, but there is a performance trade off, as mentioned with the j:true discussion above.

Generally, you want to figure out your limiting factor - be it locking, disk speed etc. and then track the levels over time and scale out (sharding) or up (better hardware) before you hit a hard limit and see performance problems.

One last thing, db.serverStatus().writeBacksQueued is actually a metric that will only ever be non-zero in a sharded environment, and it has to do with making sure that writes to a chunk during a migration are dealt with appropriately (handled by the writeback listener). Hence it essentially is a red herring here - nothing to do with general write volume.

Mongodb – Best design of mongo replica set for DB read/write speed

It is not a great practice to mix the app and database on the same server. You're competing for resources. You also mentioned remote execution versus local execution on the same disk. I'd contend that you could suffer from disk contention if it was all on the same disk and the overhead of the remote replication is questionable by comparison. A lot of factors play into that including the type of disk(s) and if you are separating the OS, logs, and data.

Regarding your topology, yes a best practice in any DB technology is to place replica set members in a different datacenter. Pending your disaster recovery requirements, having at least one of those close to the primary node would be beneficial. There are other considerations such as an emergency on one coast of the US for example that makes all 3 unavailable. That is a case by case scenario to consider though. You also should consider your read and write concerns when addressing performance and availability. The default write concern will be 1. If your greater concern is availability; you may want to use a WC of 2 or majority in the example you gave. Testing your workload will only prove this for your use case and the WC can be set per operation so if you have some operations that are less important you could just use a less reliable WC. You should explore the bulk insert method as well. Personally for time-series data I have found the default and maximum batch of 1000 to be very good. I've also had good results increasing it to 5000 and allowing the DB engine break the batches out into 1k batches. Anything after that has demonstrated poorer inserts per second.

It is quite a lengthy topic but these are some ideas.

Best Answer

Related Solutions

Mongodb – What happens if there are too many inserts in MongoDB? How to ensure all data is stored

Mongodb – Best design of mongo replica set for DB read/write speed

Related Question