Given that MySQL Replication is dual-thread, it is importatnt to recognize how Replication looks when it is broken. There are four main topics is this area
SQL Thread Dies
The SQL Thread is responsible for
- Getting the Next SQL Statement fromt the Relay Logs
- Executing the SQL Statement
- Rotating Relay Logs by Deleting any Relay Log that had all its SQL Entries Executed
If any SQL error happens, the SQL Thread simply dies and the following is posted to its Slave Status:
- Error Number
- Error Message
- SQL statement that experienced the Error
- Current database
- Master Log File where the SQL Originated
- Master Log Position where the SQL Originated
This gives an opportunity to troubleshoot, skip the error, run the SQL statement by hand, start replication back up. Sometimes it may be a SQL-based error, such as error 1062 (Duplicate Key). Other times, it may be related to the Storage Engine or the OS.
To figure out if an SQL statement will break replication, you should take any DML (INSERT, UPDATE, or DELETE) and make a corresponding SELECT using the WHERE clause of the DML. Then, run that SELECT to see if the data you are about to manipulate really exists or not.
I/O Thread Dies
The I/O Thread is responsible for four(4) things:
- Downloading SQL from the Binary Log Entries of a Master
- Recording SQL into its Local Relay Logs as a FIFO queue
- Acknowledging Communication Failure
- Attempting the Reestablish of I/O Thread
Any network latency may cause the I/O Thread to simply die and retry connection. Once a while under those circumstances, the Slave's viewpoint of the Master's log file and position (as logged in its relay logs) may be out-of-sync with what Master actually recorded in its binary logs.
Other side effects may include corrupt relay log entries
- caused by bad network transmission, which can be corrected by running CHANGE MASTER TO from the last SQL statement from the Master that the Slave executed.
- caused by corrupt binary log entries on the Master which was successfully transmitted to the relay logs, which can be corrected by
RESET MASTER;
on the Master to Zap all binary logs
- setting up replication from the new current binary log
- using pt-table-sync to correct differences
Temporary Table Usage
Troubleshooting this is like playing "pin the tail on donkey". Most developers are unaware of this until it happens and you try to fix it not realizing where the cause of this began. Here is the scenation: If you use CREATE TEMPORARY TABLE
on a Master, it will replicate to the Slave. During the time the table is in use, it will be kept in existence in the SQL Thread. If you issue STOP SLAVE;
, the SQL Thread is voluntarily killed along with all temporary tables the SQL Thread was holding. You do not realize that this has occurred until you issue START SLAVE;
and the SQL Threads dies again because the needed temp table no longer exists.
To fix this, you have perform surgery on the master's binary logs and replication as follows:
- Step 01) Locate the exact log file and position the
CREATE TEMPORARY TABLE
was issued on the Master
- Step 02) Locate the name of the database that the
CREATE TEMPORARY TABLE
was meant for
- Create the table using
CREATE TABLE
instead of CREATE TEMPORARY TABLE
- Step 03) Run
CHANGE MASTER TO
using the file and position from Step 01
- Step 04) Run
START SLAVE;
until Replication catches up or another table's nonexistence (due to CREATE TEMPORARY TABLE
) breaks replication for this same issue
- Step 05) If replication breaks again because of
CREATE TEMPORARY TABLE
on a different table, go back to Step 01
Network Inconsiderations
Once upon a time, there was a tendency for MySQL to say Replication was running when, in fact, it was not. This can happen when the network has intermittency that may delay data transmission of binary logs but not severe enough to timeout the I/O Thread. Since the MySQL process can be inconsiderate by being a little insensitive to the network, I affectionately call this "Network Inconsideration". While the bug report on this is closed, it is good to have multiple ways to check MySQL Replication as to its ability to run, especially the I/O Thread. Using MySQL 5.5, you could adjust the sensitivity of the I/O Thread using the the heartbeat and timeout parameters centered around Semisynchronous Replication.
Best Answer
I would like to suggest something radical. I got this idea from StarTrek : Deep Space 9 (Call to Arms)
With that DS9 analogy I bring your an interesting idea.
You set up 2 initial read slaves with MySQL Slave with innodb disabled
This is optional. My preference is an all-MyISAM slave to do reads because it is faster for reads than InnoDB for small datasets. Should you choose to go with InnoDB, make sure you relax the ACID compliance with this
In the event of a crash, just destroy the Slave and spin up a new one
We'll call the Slaves S0 and S1
Here is something else: have this in /etc/my.cnf in S0
These will help S0 shutdown fast with completely flushed data.
The following is what you must script in the replicator process
When you need to generate a new slave (we will call it S2), here is what you must do
STEP 01) On S0, run
service mysql stop
STEP 02) Install the same version of MySQL that S0 has into S2
STEP 03) On S0,
scp /etc/my.cnf S2:/etc/.
STEP 04) On S2, you need to change the
server-id
of /etc/my.cnf in S2 to a Unique Value (suggestion : use the 2nd,3rd, and 4th octet [without the dots] of the private IP of S2)STEP 05) On S2, either remove or comment out the
innodb_max_dirty_pages_pct = 0
from /etc/my.cnfSTEP 06) On S0, run
rsync -av /var/lib/mysql S2:/var/lib/mysql
(Note: If you have to spin up 5 Slaves, run the 5 rsyncs at this point)STEP 07) On S2,
service mysql start
(MySQL Replication start immediately where S0 had left off)STEP 08) On S0, run
service mysql start
Once you create the replicator script, you can use it spin up MySQL on new Slaves.
Meanwhile, S1 is available for SELECT queries and continues replicating.
S0 is used to recreate a new Slave.
If you are doing this in conjunction with spinning up Amazon EC2 or some other Cloud DB Servers, check with your System Administrators on any Linux commands/API that allows you to spin up a DB Server. Then, you apply the replicator to the newly generated DB Server. Even better, you can incorporate the Db Server creation API in your Replicator Script.
CAVEAT
If you have to spin up dozens or hundreds of Slaves, all you need to do is have 10 Replicator Slave Servers (S0 - S9) and have 10 copies of the replicator script operator on different Replicator Slaves.