What happened
I have asked in the comments of the question that OP provides the output of rs.status()
The reason for that was that the primary reverted to secondary status once a single member was shut down. It was obvious that the cluster lost the quorum necessary to elect a new primary. This could only be the case when one additional of the voting members of the original members of the replica set is or became unavailable.
As it turned out, the arbiter of the replica set in question was not reachable by the primary, which after the shutdown of the PC was the only remaining member (from it's point of view) of the replica set. So it wasn't possible to hold an election with a quorum of the configured replica set members and it consequently reverted to secondary state.
How to prevent
- Always run
rs.status()
after setting up a replica set.
- Always run
rs.status()
when encountering problems with a replica set
- Always do fail tests (down to loosing write capabilities) and ensure your application handles those situations gracefully (as OP did)
Using these rules, you will eliminate the vast majority of problems one can face when using a replica set.
Personally, I think MongoDB Inc.'s Cloud Manager is a must for production environments, since it shows such problems os OP had right away and has alerting built in.
Side note
Never, ever (and yes, that means no exception for no reason, however sound the reasons may seem to be) put an arbiter on a data bearing node of the same replica set.
Imagine the node with the arbiter and the data bering node goes down.
If you have a 3 member replica set, you wouldn't have a quorum of the original members, your remaining member would automatically revert to secondary, loosing the failover capability.
In a 5 member replica set, two voting members would be eliminated. Fine as long as all others are up and running, right? Except, it isn't fine. If another node fails, you'd loose your quorum again. So with only two nodes failed, the other two nodes become more or less useless. Given the price of a virtual server today (and even the smallest ones are well sufficient to run an arbiter), this simply does not make sense. You'd be paying 4 data bearing nodes anyway and loose failover capabilities because you tried to save a tiny fraction of the overall costs.
With a 7 member replica set, those costs become an even tinier fraction of the overall costs.
Conclusion: It's simply a bad decision business wise to have a arbiter running on the same machine as a data bearing node, even when setting aside the technical aspects.
Best Answer
There is no strict requirement to have the same O/S for all members of a replica set, but in general it is a good idea to have consistent O/S so you have similar configuration and performance tuning across replica set members.
However, since your DC3 (cloud) instance appears to be an arbiter (which only participates in voting) any O/S differences should be irrelevant to performance.
Amazon Linux evolved from RHEL, so isn't an entirely different O/S (for example, like Linux vs Windows). However, there may be different configuration or tuning between Linux distributions. I wouldn't expect dramatically different performance between Linux distros, but this is something you'd have to test with your own use case and workload.
This is up to your own security policy, but I would expect patch upgrades to be applied similarly for On-Premise versus cloud servers.
Assuming you have an equal number of instances in DC1 vs DC2, an arbiter can be useful to ensure a primary can be elected in the event either DC is unreachable. An arbiter cannot acknowledge writes (since it is a voting-only node), so if you have a Primary-Secondary-Arbiter (PSA) configuration you will not be able to acknowledge majority writes if one of your data bearing nodes is unavailable.
I would strongly recommend using PSS (i.e. no arbiter) to support consistent failover with both elections and majority write concern.
As noted above, arbiters cannot acknowledge write so are not recommendable if you want to support fault tolerance with majority write concern. With a PSA configuration degraded to PsA (one secondary down), you have write availability (since a primary can still be maintained) but no longer have replication or data redundancy (since there is only one data-bearing node writing data).