Mysql – Programatically testing slave status – What fields in SHOW SLAVE STATUS indicated slave status is OK / in ERROR

MySQLreplicationtesting

We use pingdom and a few internal heartbeat mechanisms on our product.

We would like to add a 'are our slaves running and in good shape' check.

What fields in SHOW SLAVE STATUS indicate that the slave where SHOW SLAVE STATUS was executed on is not (or is) replicating correctly?

The 2 we are currently testing are:

Slave_IO_Running (must equal Yes)

Slave_SQL_Running (must equal Yes)

Seconds_Behind_Master (must be integer, and less then 10 seconds behind master)

Any others we should be testing?

Thanks again!

Best Answer

SCENARIO #1

You may want to monitor Relay_Log_Space. It represents the some total of relay logs the Slave has to process. When should you worry about it?

Give this scenario

  • Slave_IO_Running is Yes
  • Slave_SQL_Running is No

This indicates that that the IO Thread is still downloading entries from its Master's binary logs. The longer you take to address the SQL error, the more the Relay_Log_Space will pile up. If the Master is in a heavy-write environment, relay logs will pile up quickly and puts the Slave at risk for disk space issues.

If you do not want to monitor Relay_Log_Space, you can always set the relay_log_space_limit. This force a cap on the amount of disk space allow for relay logs. For example, if you set this:

[mysqld]
relay_log_space_limit=8G

and restart mysql, this causes the IO Thread to stop collecting new entries from the Master when the sum of all relay log file sizes >= 8GB.. After the SQL Thread is done processing the oldest relay log, that relay log gets rotated out and the IO Thread starts getting entries from the Master where it left off. This provides a decent throttling mechanism for relay logs.

Another field you may want to look at is Last_SQL_Errno. Once Slave_IO_Running is Yes and Slave_SQL_Running is No, the Last_SQL_Errno will be set to something nonzero. On very rare occasions, it is possible to have this:

  • Slave_IO_Running is Yes
  • Slave_SQL_Running is No
  • Last_SQL_Errno is 0

When you have this unique phenomenon, mysqld has encountered an undocumented error (Don't laugh, I have seen this).

SCENARIO #2

Something else to keep in mind (not necessarily for monitoring)

Give this scenario

  • Slave_IO_Running is No
  • Slave_SQL_Running is No

Under what three circumstances would this occur?

  1. Someone ran STOP SLAVE; by hand or in a backup script
  2. The option skip-slave-start is in /etc/my.cnf and mysql was started
  3. Issuing service mysql start --skip-slave-start or service mysql restart --skip-slave-start

Make sure not to monitor false positives should both replication threads are not running.

SCENARIO #3

Give this scenario

  • Slave_IO_Running is No
  • Slave_SQL_Running is Yes

This quickly indicates that there is a communication error because of one of the following

  1. The Master went down
  2. mysqld on the Master went down
  3. Network Intermittency