Mysql – A minimum of a 3 Node Galera Cluster is Recommended, but is it better to have 5 nodes

galeralinuxMySQL

Right now I have a 400GB database, with a 5 node Galera Cluster. They are all RAID 10 SSDs.

I've read the following:

If the node goes missing due to a network problem or otherwise leaves
without telling the rest of the cluster, then problems can arise. For
the cluster to function, it needs a quorum, a majority of nodes active
in the cluster. The two other nodes will continue to function normally
since their partition has more than half of the known nodes but the
node that left will stop accepting queries when it realizes that it is
no longer in contact with the active partition. In this case, assuming
an application can access the two active nodes, the failure can go
mostly unnoticed.

I am trying to reduce my cost and make things more optimal. My cluster is handling a few thousand queries per minute. Is it safe to have a 3 node cluster?

What happens if 1 or 2 of the nodes go down, would there be a total outage of the database?

Is it recommended to have a 5 node cluster over a 3 node cluster?

Should I put them on RAID 1, RAID 0, or RAID 10? What is suffice?

Best Answer

If the criteria for "safe" is "no single point of failure", then 3 suffices.

If "safe" means surviving "any two points of failure", then there is no solution. 5 only handles the case where two servers go down, not arbitrary combinations of things.

RAID 10 (or 5, but not 1 or 0) provides recovery from a single disk failure on a single machine. Since a 3+ node cluster can survive the failure of one entire disk subsystem, RAID is not required; it just gives you an extra level of comfort.

I do like RAID 10 with Battery Backed Write Cache -- this has the bonus of making writes virtually instantaneous.

Here is a situation that can happen with a 3-node cluster (N1, N2, N3). Let's say N1 dies. After putting a new (or repaired) N1 into the cluster, it will rebuild the data. This uses N2 to as the 'donor' to rebuild N1. That leaves only N3 at full functionality. (N2 will be somewhat busy sending data to N1.) The Cluster is still alive, though "slowed".