Mysql – How to avoid MySQL slave to have errors after crash

MySQLreplication

I have a MySQL master/slave configuration (MariaDB 10.0.17). Sometimes, the slave or the master server crashes. After the crash and restart, I get some errors of duplicated keys on slave. The server get stuck into these errors and I need to take action by deleting the duplicated key entries so it goes on.

How to avoid this or, if not possible, do this "slave recover" automatically?

Best Answer

  • Regarding de-synchronization of a single or master server: Mosty, the cause of these problems come from the fact that the master is not in full ACID mode or it is using MyISAM tables. In order to avoid problems with a server, only use InnoDB as the engine of your tables and set innodb_flush_log_at_trx commit to 1. This is not efficient -you will lose a lot of performance in a busy server except if you can get advantage of the group commit, but it is the only way to guarantee data consistency with disk (no transaction loss).

    Many people, that do not have appropriate hardware (hardware cache), are forced to relax the value of this variable to 0 or 2, but in that case, you (or your hardware) is responsible to assure the availability of the disk/service in case something goes wrong (crash).

  • The second reason where disk can be out of sync with a master or single server is not using sync_binlog=1. By default, this is set to 0, and that means that you can write data internally, but that doesn't assure that it is written to the binary log. This shouldn't be an issue if you assume the master is always going to be back online, but you must understand that may be an issue. Activating it almost doubles your IOPS requirements, so be careful. It is not usually necessary and I do not recommend you to blindly activate it.

  • Third, the most common cause of desynchronization is the storage of the replication position in the non-transactional master.info file in the slave. This is the default configuration. I highly recommend changing the configuration of the variables master_info_repository and relay_log_info_repository to use TABLE (InnoDB transactional tables) instead of FILE. This will get rid of the desynchronization problems when the slave crashes.

  • Additionally, in order to make sure that the master and slave are always in sync, and that no transactions are skipped, I recommend you to use the GTID replication, available in MySQL 5.6/MariaDB 10. While it does not prevent the desync of the master and slave, it makes easier to identify it and resync master and slave. This is a tutorial on how to enable it on MariaDB 10. Please note that GTID implementations between MySQL and MariaDB are not compatible. Once GTID is running, things like failover are greatly simplified.

  • There could be other reasons for desynchronization, like undeterministic updates, or replication filters. Avoid the later, and you can mitigate the former by using row-based replication binlog_format = ROW.

  • On a different note, it is important to remark that MySQL default replication protocol is asynchronous, so one can never guarantee the same state on the master and the slave. If slave drift and other associated issues are a no-go, you can try alternatives like the semi-sync plugin for replication or Galera cluster, which both make tighter links between nodes.

    Alternatively, I recommend you to both constantly monitor and repair replication problems with tools like pt-heartbeat, pt-table-checksum and pt-table-sync, which will avoid the need to manually provision the slave for every incident you encounter.

In order to have a more accurate answer, I recommend you to hire a consultant so that you can apply the best strategy in your case (some of the above may have an important impact on your performance while not being your specific problem).