MongoDB Production Amazon – 3 data nodes per replica set, or 2 data nodes plus a shared arbiter

amazon ec2mongodbreplicationsharding

I have an Amazon EC2 deployment of mongodb (3.4). Things are going well, but the DB size is growing quickly. I am about to shard a large collection in order to begin horizontal scaling (total space needed is 2TB).

So either (ignoring config servers, etc):

1. Each shard is a replica set with 3 data nodes

total cost = $1,130 / mo

6x m4.large ($85 ea) + 4x 1TB SSDs ($116 ea) + 2x 1TB magnetic (to save some $$) ($78 ea)

2. Each shard is a replica set with 2 data nodes, each also using a shared arbiter

total cost = $809 / mo

4x m4.large ($85 ea) + 4x 1TB SSDs ($116 ea) + 1x arbiter (cheapest machine is $5)


Diff is $321/mo

I get the feeling that option 2, in Amazon's hosted environment, using exclusively SSDs, should be quite durable. As far as I can see, the only problem with Option 2 is that if a primary node dies and I failover to the secondary, for that period there is no backup. But I can't actually evaluate the severity of this scenario.

Could Option 2 be mitigated by attaching a spinning disk to each data node, that will act as a backup volume?

If anyone could provide some advice from experience it would be super helpful, but any advice is appreciated.

Thanks,

Best Answer

Option 2 has no problems in theory; however the more limited redundancy may be a problem.

  1. About the arbiters:

Yes, you can run multiple arbiters on a single lightweight machine. An arbiter is a very lightweight process, doing little but voting, so you can run multiple arbiters on a single box. Each one must be a separate mongod process.

  1. About the redundancy:

If you have a 2-data-node replica set, and one of the data nodes fails, then the other one can function as primary (with the vote of the arbiter), so you have resilience there, but you no longer have redundancy.

Whether that is a severe problem or not depends on your ops, how quickly you can fix or replace the failed node. That depends on your monitoring capabilities, your ops staff competence and availability, etc; that's a decision you need to evaluate.