MySQL Replication and High Availability Guide

high-availabilityMySQLreplication

I know this might not be a very good question to be asked in this site, but I really need some helps/hints to be directed to the right path.

I want to have failover strategy for 2 MySQL servers which are in Master-Master Replication mode. I have done the replication, and it's working just fine. Yet, when in comes to the failover strategy, I'm kind of clueless…

I've been reading on the internet over the past couple of days (I know this is a very complex procedure, and I could not master it in a week or so), but the more I read the more confuse I get. Basically, as I understood, I need to have a software to monitor the network and report any failure and another one to switch between two systems. Heartbeat apparently is the best in doing so, but in all tutorials I read, they are using many other software ( like MON for monitoring, or other software for dumping or getting backups) as well. Is it necessary to have them? Can one have fail-over strategy only using MySQL and HA tool like Heartbeat?

Each website is introducing different tool and ways, which they have their own pros and cons, and of course it depends on how I want my "cluster" to work, and how much high availability is important to my system. I am just considering pointing it out to my customers, and it's not going to the production phase any soon, since my customers running small businesses and don't care about loosing data that much, however, I wanted to dig into the concept by setting up a simple master-master replication where in case of failure of any servers, clients automatically commit changes to the standby server.

Best Answer

Given the fact that you mentioned you have Master-Master replication mode, I would not recommend any automatic failover unless you properly account for Replication Lag. After all, MySQL Replication is asynchronous.

It is theoretically possible to have the following:

DBServer1 as Master to DBServer2
DBServer2 as Master to DBServer1
DBVIP pointing at DBServer1
DBServer2 is 180 seconds behind
DBServer1 goes down
Automatic Failover moves DBVIP to DBServer2

With this scenario, DBServer2 could have auto increment keys that do not exist yet. Upon failover, the DBVIP will allow WebServers to connect to DBServer2 and ask for data that does not exist yet.

This would therefore require background processes running on each DBServer.

For the above scenario:

DBVIP is on DBServer1
DBServer1 runs HeartBeat
DBServer2 runs HeartBeat
Background Process on DBServer1 to monitor
- a) Data Mount Availability
- b) Data Mount Writeability
- c) MySQL Connectivity
- Once a,b, or c fail, kill HeartBeat

Background Process on DBServer2 to make sure DBVIP is pingable

What should killing HeartBeat do? Trigger the startup script defined for it.

What should the startup script on DBServer2 look for?

Loop until DBVIP is unreachable via ping
Connect to MySQL and
- Run SHOW SLAVE STATUS\G in a Loop until Seconds_Behind_Master is NULL
- RUn SHOW SLAVE STATUS\G in a Loop until `Exec_Master_Log_Pos stops changing
Assign DBVIP to DBServer2 via ip addr add

This is essentailly the algorithm for failing over safely to a Passive Master in a Master/Master Replication Cluster.

ALTERNATIVE

If ALL your data is InnoDB, I recommend something with less rigor. Perhaps you should look into using DRBD and HeartBeat. Here is why:

DRBD provides network RAID-1 for a Block Device on two servers.

You would essentially do this:

Have DBServer1's DRBD Block Device as Primary
Have DBServer2's DRBD Block Device as Secondary
Mount DBServer1's DRBD Device on /var/lib/mysql
Startup MySQL on DBServer1
Have HeartBeat Monitor Ping Activity Between Servers

What would startup script look like in a DRBD scenario?

Loop until DBVIP is unreachable via ping
Kill HeartBeart
Disconnect DRBD
Promote DRBD to Primary
Mount DRBD on /var/lib/mysql
Start MySQL (InnoDB Crash Recovery Fills in Missing Data)
Assign DBVIP via ip addr add

This is a lot more straightforward because only one side is Active. The Passive side (DRBD Secondary) is a Synchronous Disk Copy of the Active Side (DRBD Primary).

CAVEAT

If all or most of the working set data is MyISAM, do not touch DRBD. Crash scnearios quickly result in MyISAM tables being marked crashed and need auto-repair (which can be paintfully slow to wait for).

UPDATE 2012-12-29 08:00 EDT

Here are my past posts on using DRBD with MySQL

Mar 29, 2011 : MySQL high availability, failover and replication with Latency
Aug 29, 2011 : MySQL Replication : 1 Slave / Multiple Masters
Dec 19, 2011 : Best way to setup master to multi master replication
Jul 25, 2012 : Mysql database replication on different vlan/subnet/another site (I stirred a pretty big pot on this one in someone else's blog)

Best Answer

ALTERNATIVE

CAVEAT

UPDATE 2012-12-29 08:00 EDT

Related Solutions

MySQL High Availability – Failover and Replication with Latency

Postgresql – Streaming replication and failover on PostgreSQL

Related Question