Mysql – Programatically testing slave status – What fields in SHOW SLAVE STATUS indicated slave status is OK / in ERROR

MySQLreplicationtesting

We use pingdom and a few internal heartbeat mechanisms on our product.

We would like to add a 'are our slaves running and in good shape' check.

What fields in SHOW SLAVE STATUS indicate that the slave where SHOW SLAVE STATUS was executed on is not (or is) replicating correctly?

The 2 we are currently testing are:

Slave_IO_Running (must equal Yes)

Slave_SQL_Running (must equal Yes)

Seconds_Behind_Master (must be integer, and less then 10 seconds behind master)

Any others we should be testing?

Thanks again!

Best Answer

SCENARIO #1

You may want to monitor Relay_Log_Space. It represents the some total of relay logs the Slave has to process. When should you worry about it?

Give this scenario

Slave_IO_Running is Yes
Slave_SQL_Running is No

This indicates that that the IO Thread is still downloading entries from its Master's binary logs. The longer you take to address the SQL error, the more the Relay_Log_Space will pile up. If the Master is in a heavy-write environment, relay logs will pile up quickly and puts the Slave at risk for disk space issues.

If you do not want to monitor Relay_Log_Space, you can always set the relay_log_space_limit. This force a cap on the amount of disk space allow for relay logs. For example, if you set this:

[mysqld]
relay_log_space_limit=8G

and restart mysql, this causes the IO Thread to stop collecting new entries from the Master when the sum of all relay log file sizes >= 8GB.. After the SQL Thread is done processing the oldest relay log, that relay log gets rotated out and the IO Thread starts getting entries from the Master where it left off. This provides a decent throttling mechanism for relay logs.

Another field you may want to look at is Last_SQL_Errno. Once Slave_IO_Running is Yes and Slave_SQL_Running is No, the Last_SQL_Errno will be set to something nonzero. On very rare occasions, it is possible to have this:

Slave_IO_Running is Yes
Slave_SQL_Running is No
Last_SQL_Errno is 0

When you have this unique phenomenon, mysqld has encountered an undocumented error (Don't laugh, I have seen this).

SCENARIO #2

Something else to keep in mind (not necessarily for monitoring)

Give this scenario

Slave_IO_Running is No
Slave_SQL_Running is No

Under what three circumstances would this occur?

Someone ran STOP SLAVE; by hand or in a backup script
The option skip-slave-start is in /etc/my.cnf and mysql was started
Issuing service mysql start --skip-slave-start or service mysql restart --skip-slave-start

Make sure not to monitor false positives should both replication threads are not running.

SCENARIO #3

Give this scenario

Slave_IO_Running is No
Slave_SQL_Running is Yes

This quickly indicates that there is a communication error because of one of the following

The Master went down
mysqld on the Master went down
Network Intermittency

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

Replication filtering isn't bulletproof. Due to how the filtering is implemented the events responsible for your errors are being generated because the default database at query runtime is the my-database schema as expected and the query being executed is fully qualified INSERT INTO phpmyadmin.pma_column_info...

Peter Zaitsev explains the scenario well in this post:

Filtered MySQL Replication

Thesql 5.6 gtid replication slave stuck (system lock)

Since I see more than 2 system user entries in the processlist, I would assume you are using Multi-Threaded Replication (slave_parallel_workers > 1).

That looks like a bug

On Oct 29, 2014, this was expressed by David Moss

Thank you for your feedback. This issue was covered in bug 17326020 and the following was added to the MySQL 5.6.21 and 5.7.5 changelogs:

When the I/O thread reconnected to a master using GTIDs and multithreaded slaves while in the middle of a transaction, it failed to abort the transaction, leaving a partial transaction in the relay log, and then retrieving the same transaction again. This occurred when performing a rotation of the relay log. Now when reconnecting, the server checks before rotating the log in such cases, and waits first for any ongoing transaction to complete.

Therefore nothing new will be added to cover this bug and I'm closing it as fixed.

On Dec 10, 2014, this was expressed by Laurynas Biveinis

Problem:

With MTS, GTIDs and auto positioning enabled, when a worker applies a partial transaction left on relaylog by an IO thread reconnection, it will wait for the XID log event to commit the transaction.

Unfortunately, the SQL thread coordinator will reach the master's ROTATE event on the next relaylog file and will wait for all workers to finish their tasks before applying the ROTATE.

Analysis:

As the whole transaction is retrieved again by the IO thread after the reconnection, the slave must rollback the partial transaction once noticing this ROTATE from the master.

This bug reports the same issue already fixed by BUG#17326020, and the reported issue is not reproducible anymore. So, this patch is just adding a new test case.

SUGGESTION

Run FLUSH BINARY LOGS; on the Master

See if the movement triggers a response from the SQL threads.

If it does not, go ahead and remove slave_parallel_workers from my.cnf and restart mysql.

Since you started MySQL up and master and slave and got error 1236, that means you are trying to establish replication from an impossible position. In the context of GTID and error message you got, the binary logs needed to fully identify a set of queries within a GTID set no longer exists,

Look back at your SHOW SLAVE STATUS\G

Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
 Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274

From this, the last GTID executed is 7846a847-62c7-11e5-91a6-e06995de432e:4783274

This means that the binary log that has or had 7846a847-62c7-11e5-91a6-e06995de432e:4783275 no longer exists.

I can see this happening if you stopped replication on the Slave, left replication off long enough for the Master to rotate its binary logs (via expire_logs_days) the slave still needed to see, then turned on replication.

In your particular case, try doing a mysqlbinlog dump of the binary log mysqld-bin.000141. If nothing comes out of it, you will have to reload the Slave and setup replication from scratch.

Best Answer

SCENARIO #1

SCENARIO #2

SCENARIO #3

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

Thesql 5.6 gtid replication slave stuck (system lock)

SUGGESTION

Related Question