Mysql – what are the conditions under which thesql replication might break

MySQLreplication

I am running mysql server in master/slave configuration mode. There are multiple slaves replicating from a master. They are working fine and their replication status is monitored.

One of the slaves serves for a special purpose. The devs have given me a set of update queries to be run on replicated slave db on slave A.

They are updating old records in a table in a db (modifying a column data for lot of records).

Questions

Is there any way I can figure out if the given query will break the replication ?
For the future, how do I determine if the query will break the replication or not ?

Best Answer

Given that MySQL Replication is dual-thread, it is importatnt to recognize how Replication looks when it is broken. There are four main topics is this area

SQL Thread Dies

The SQL Thread is responsible for

Getting the Next SQL Statement fromt the Relay Logs
Executing the SQL Statement
Rotating Relay Logs by Deleting any Relay Log that had all its SQL Entries Executed

If any SQL error happens, the SQL Thread simply dies and the following is posted to its Slave Status:

Error Number
Error Message
SQL statement that experienced the Error
Current database
Master Log File where the SQL Originated
Master Log Position where the SQL Originated

This gives an opportunity to troubleshoot, skip the error, run the SQL statement by hand, start replication back up. Sometimes it may be a SQL-based error, such as error 1062 (Duplicate Key). Other times, it may be related to the Storage Engine or the OS.

To figure out if an SQL statement will break replication, you should take any DML (INSERT, UPDATE, or DELETE) and make a corresponding SELECT using the WHERE clause of the DML. Then, run that SELECT to see if the data you are about to manipulate really exists or not.

I/O Thread Dies

The I/O Thread is responsible for four(4) things:

Downloading SQL from the Binary Log Entries of a Master
Recording SQL into its Local Relay Logs as a FIFO queue
Acknowledging Communication Failure
Attempting the Reestablish of I/O Thread

Any network latency may cause the I/O Thread to simply die and retry connection. Once a while under those circumstances, the Slave's viewpoint of the Master's log file and position (as logged in its relay logs) may be out-of-sync with what Master actually recorded in its binary logs.

Other side effects may include corrupt relay log entries

caused by bad network transmission, which can be corrected by running CHANGE MASTER TO from the last SQL statement from the Master that the Slave executed.
caused by corrupt binary log entries on the Master which was successfully transmitted to the relay logs, which can be corrected by
- RESET MASTER; on the Master to Zap all binary logs
- setting up replication from the new current binary log
- using pt-table-sync to correct differences

Temporary Table Usage

Troubleshooting this is like playing "pin the tail on donkey". Most developers are unaware of this until it happens and you try to fix it not realizing where the cause of this began. Here is the scenation: If you use CREATE TEMPORARY TABLE on a Master, it will replicate to the Slave. During the time the table is in use, it will be kept in existence in the SQL Thread. If you issue STOP SLAVE;, the SQL Thread is voluntarily killed along with all temporary tables the SQL Thread was holding. You do not realize that this has occurred until you issue START SLAVE; and the SQL Threads dies again because the needed temp table no longer exists.

To fix this, you have perform surgery on the master's binary logs and replication as follows:

Step 01) Locate the exact log file and position the CREATE TEMPORARY TABLE was issued on the Master
Step 02) Locate the name of the database that the CREATE TEMPORARY TABLE was meant for
Create the table using CREATE TABLE instead of CREATE TEMPORARY TABLE
Step 03) Run CHANGE MASTER TO using the file and position from Step 01
Step 04) Run START SLAVE; until Replication catches up or another table's nonexistence (due to CREATE TEMPORARY TABLE) breaks replication for this same issue
Step 05) If replication breaks again because of CREATE TEMPORARY TABLE on a different table, go back to Step 01

Network Inconsiderations

Once upon a time, there was a tendency for MySQL to say Replication was running when, in fact, it was not. This can happen when the network has intermittency that may delay data transmission of binary logs but not severe enough to timeout the I/O Thread. Since the MySQL process can be inconsiderate by being a little insensitive to the network, I affectionately call this "Network Inconsideration". While the bug report on this is closed, it is good to have multiple ways to check MySQL Replication as to its ability to run, especially the I/O Thread. Using MySQL 5.5, you could adjust the sensitivity of the I/O Thread using the the heartbeat and timeout parameters centered around Semisynchronous Replication.

Related Solutions

Mysql – Master port is disabled from Public IP, MySQL log is not getting refreshed

The basic problem stems from the Slave's IO Thread. Sometimes, it has the nasty habit of playing dumb and not properly checking or pinging the master.

MySQL 5.5's semisynchronous replication allows you to have the master time itself out and degrade back to asynchronous replication. Slaves running semisynchronous replication should have a more sensitive IO Thread now.

Concerning your particular situation, you are using MySQL 5.0.24 ??? That's a very old version. There are two bug reports (Bug1 and Bug2) discussing this.

It just dawned on me that you asked this question earlier and DTest answered it. Giving credit where credit is due, the bug reports came from his answer.

It would be nice to to be informed when the Public IP is being disabled from Slave access. That way, you could run STOP SLAVE; on the slave in advance.

You also asked is there a way to come to know the master status from the slave, when this type of behavior occurs.

Here is a typical SHOW SLAVE STATUS\G

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.64.80.136
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000019 <<------
          Read_Master_Log_Pos: 277892198 <<------
               Relay_Log_File: relay-bin.000058
                Relay_Log_Pos: 37535484
        Relay_Master_Log_File: mysql-bin.000019 <<------
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 277892198 <<------
              Relay_Log_Space: 277892637 <<------
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

Please note the following:

Master_Log_File : Log file from Master whose latest SQL was last copied to relay logs
Read_Master_Log_Pos : Log position within Master_Log_File whose latest SQL was last copied to relay logs
Relay_Master_Log_File : Log file from Master whose SQL was last executed from the relay logs
Exec_Master_Log_Pos : Log position within Relay_Master_Log_File whose SQL was last executed from the relay logs

If none of these value are moving, that could mean the all SQL is processed or that the slave is currently processing an SQL statement in the SQL thread that came from position Exec_Master_Log_Pos of the Master Log Relay_Master_Log_File.

Now, look at Relay_Log_Space. This number represents the sum total of all filesizes for all relay logs. If Slave_IO_Running is Yes and Relay_Log_Space is not changing, then go check the master by running SHOW MASTER STATUS;. It should like something like this:

mysql> show master status;
+------------------+-----------+--------------+------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+-----------+--------------+------------------+
| mysql-bin.000019 | 682563021 |              |                  |
+------------------+-----------+--------------+------------------+
1 row in set (0.00 sec)

If the Master's (File,Position) of SHOW MASTER STATUS; is beyond the Slave's (Master_Log_File,Read_Master_Log_Pos) of SHOW SLAVE STATUS\G, then the IO Thread on the Slave is dead for intents and purposes, even if Slave_IO_Running is Yes.

Mysql – pt-table-checksum help required

You say you are using pt-table-checksum 2.0.1. I would recommend updating to 2.1, as there are many improvements in the tool.

Next, let me address your test. You say the slave was not updated after the first or second commands that you ran. The second command looks to be trying to connect directly to the slave. pt-table-checksum won't report any differences unless the server you're connecting to has slaves.

Also, the --replicate-check-only option will not do any checksumming. (from the docs):

If specified, pt-table-checksum doesn’t checksum any tables. It checks replicas for differences found by previous checksumming, and then exits.

Your first command doesn't seem to be able to connect to the slave host, which is why it doesn't report any differences. Make sure the user/pass that is connecting to the master can also connect to the slave.

Now, as for your complex setup, you are right to worry about breaking replication. With some slaves replicating only certain tables, you should heed the warning here:

If the replicas are configured with any filtering options, you should be careful not to checksum any databases or tables that exist on the master and not the replicas.

You can specify which databases you want to checksum with the --databases option, and give a specific list of tables with the --tables option. Alternatively you can use the --ignore-databases and --ignore-tables options to provide a list of databases/tables to not checksum.

This will probably mean you will want separate pt-table-checksum commands based on which slaves you are trying to checksum. You will probably have to use the 'dsn' --recursion-method to accomplish this (I've never done it, personally)

As for load, pt-table-checksum comes with some options to throttle itself. Namely --max-load and --max-lag.

The tool keeps track of how quickly the server is able to execute the queries, and adjusts the chunks as it learns more about the server’s performance. It uses an exponentially decaying weighted average to keep the chunk size stable, yet remain responsive if the server’s performance changes during checksumming for any reason. This means that the tool will quickly throttle itself if your server becomes heavily loaded during a traffic spike or a background task, for example.