The basic problem stems from the Slave's IO Thread. Sometimes, it has the nasty habit of playing dumb and not properly checking or pinging the master.
MySQL 5.5's semisynchronous replication allows you to have the master time itself out and degrade back to asynchronous replication. Slaves running semisynchronous replication should have a more sensitive IO Thread now.
Concerning your particular situation, you are using MySQL 5.0.24 ??? That's a very old version. There are two bug reports (Bug1 and Bug2) discussing this.
It just dawned on me that you asked this question earlier and DTest answered it. Giving credit where credit is due, the bug reports came from his answer.
It would be nice to to be informed when the Public IP is being disabled from Slave access. That way, you could run STOP SLAVE;
on the slave in advance.
You also asked is there a way to come to know the master status from the slave, when this type of behavior occurs.
Here is a typical SHOW SLAVE STATUS\G
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.64.80.136
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000019 <<------
Read_Master_Log_Pos: 277892198 <<------
Relay_Log_File: relay-bin.000058
Relay_Log_Pos: 37535484
Relay_Master_Log_File: mysql-bin.000019 <<------
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 277892198 <<------
Relay_Log_Space: 277892637 <<------
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec)
Please note the following:
- Master_Log_File : Log file from Master whose latest SQL was last copied to relay logs
- Read_Master_Log_Pos : Log position within
Master_Log_File
whose latest SQL was last copied to relay logs
- Relay_Master_Log_File : Log file from Master whose SQL was last executed from the relay logs
- Exec_Master_Log_Pos : Log position within
Relay_Master_Log_File
whose SQL was last executed from the relay logs
If none of these value are moving, that could mean the all SQL is processed or that the slave is currently processing an SQL statement in the SQL thread that came from position Exec_Master_Log_Pos
of the Master Log Relay_Master_Log_File
.
Now, look at Relay_Log_Space
. This number represents the sum total of all filesizes for all relay logs. If Slave_IO_Running
is Yes and Relay_Log_Space
is not changing, then go check the master by running SHOW MASTER STATUS;
. It should like something like this:
mysql> show master status;
+------------------+-----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+-----------+--------------+------------------+
| mysql-bin.000019 | 682563021 | | |
+------------------+-----------+--------------+------------------+
1 row in set (0.00 sec)
If the Master's (File,Position) of SHOW MASTER STATUS;
is beyond the Slave's (Master_Log_File,Read_Master_Log_Pos) of SHOW SLAVE STATUS\G
, then the IO Thread on the Slave is dead for intents and purposes, even if Slave_IO_Running is Yes.
You say you are using pt-table-checksum 2.0.1. I would recommend updating to 2.1, as there are many improvements in the tool.
Next, let me address your test. You say the slave was not updated after the first or second commands that you ran. The second command looks to be trying to connect directly to the slave. pt-table-checksum won't report any differences unless the server you're connecting to has slaves.
Also, the --replicate-check-only
option will not do any checksumming. (from the docs):
If specified, pt-table-checksum doesn’t checksum any tables. It checks replicas for differences found by previous checksumming, and then exits.
Your first command doesn't seem to be able to connect to the slave host, which is why it doesn't report any differences. Make sure the user/pass that is connecting to the master can also connect to the slave.
Now, as for your complex setup, you are right to worry about breaking replication. With some slaves replicating only certain tables, you should heed the warning here:
If the replicas are configured with any filtering options, you should be careful not to checksum any databases or tables that exist on the master and not the replicas.
You can specify which databases you want to checksum with the --databases
option, and give a specific list of tables with the --tables
option. Alternatively you can use the --ignore-databases
and --ignore-tables
options to provide a list of databases/tables to not checksum.
This will probably mean you will want separate pt-table-checksum commands based on which slaves you are trying to checksum. You will probably have to use the 'dsn' --recursion-method
to accomplish this (I've never done it, personally)
As for load, pt-table-checksum comes with some options to throttle itself. Namely --max-load
and --max-lag
.
The tool keeps track of how quickly the server is able to execute the queries, and adjusts the chunks as it learns more about the server’s performance. It uses an exponentially decaying weighted average to keep the chunk size stable, yet remain responsive if the server’s performance changes during checksumming for any reason. This means that the tool will quickly throttle itself if your server becomes heavily loaded during a traffic spike or a background task, for example.
Best Answer
Given that MySQL Replication is dual-thread, it is importatnt to recognize how Replication looks when it is broken. There are four main topics is this area
SQL Thread Dies
The SQL Thread is responsible for
If any SQL error happens, the SQL Thread simply dies and the following is posted to its Slave Status:
This gives an opportunity to troubleshoot, skip the error, run the SQL statement by hand, start replication back up. Sometimes it may be a SQL-based error, such as error 1062 (Duplicate Key). Other times, it may be related to the Storage Engine or the OS.
To figure out if an SQL statement will break replication, you should take any DML (INSERT, UPDATE, or DELETE) and make a corresponding SELECT using the WHERE clause of the DML. Then, run that SELECT to see if the data you are about to manipulate really exists or not.
I/O Thread Dies
The I/O Thread is responsible for four(4) things:
Any network latency may cause the I/O Thread to simply die and retry connection. Once a while under those circumstances, the Slave's viewpoint of the Master's log file and position (as logged in its relay logs) may be out-of-sync with what Master actually recorded in its binary logs.
Other side effects may include corrupt relay log entries
RESET MASTER;
on the Master to Zap all binary logsTemporary Table Usage
Troubleshooting this is like playing "pin the tail on donkey". Most developers are unaware of this until it happens and you try to fix it not realizing where the cause of this began. Here is the scenation: If you use
CREATE TEMPORARY TABLE
on a Master, it will replicate to the Slave. During the time the table is in use, it will be kept in existence in the SQL Thread. If you issueSTOP SLAVE;
, the SQL Thread is voluntarily killed along with all temporary tables the SQL Thread was holding. You do not realize that this has occurred until you issueSTART SLAVE;
and the SQL Threads dies again because the needed temp table no longer exists.To fix this, you have perform surgery on the master's binary logs and replication as follows:
CREATE TEMPORARY TABLE
was issued on the MasterCREATE TEMPORARY TABLE
was meant forCREATE TABLE
instead ofCREATE TEMPORARY TABLE
CHANGE MASTER TO
using the file and position from Step 01START SLAVE;
until Replication catches up or another table's nonexistence (due toCREATE TEMPORARY TABLE
) breaks replication for this same issueCREATE TEMPORARY TABLE
on a different table, go back to Step 01Network Inconsiderations
Once upon a time, there was a tendency for MySQL to say Replication was running when, in fact, it was not. This can happen when the network has intermittency that may delay data transmission of binary logs but not severe enough to timeout the I/O Thread. Since the MySQL process can be inconsiderate by being a little insensitive to the network, I affectionately call this "Network Inconsideration". While the bug report on this is closed, it is good to have multiple ways to check MySQL Replication as to its ability to run, especially the I/O Thread. Using MySQL 5.5, you could adjust the sensitivity of the I/O Thread using the the heartbeat and timeout parameters centered around Semisynchronous Replication.