MySQL Replication Stops, But Still Says it is Running

MySQLmysql-5.5replication

I've got a strange replication problem. For reference I am using MySQL 5.5 with Statement Based Replication. We have a Master with Slaves on site A / network A, and slaves on site B / network B.

Everything on network A is fine. The problem occurs with the databases on site B / network B that are connecting to the master at Site A.

On a few occasions now I have seen replication stop on the databases at site B. If I look at the SLAVE_IO_RUNNING and SLAVE_SQL_RUNNING in SHOW SLAVE STATUS they both say yes. To all extents and purposes everything looks fine to me. However, my monitoring is reporting that the heartbeat event we run is falling behind. It's as if the slave was connected to the master, but not receiving any data.

The various pos values (read_master_log_pos, relay_log_pos, exec_master_log_pos) are all static and not moving. I also checked the relay log on one, and the incoming data just stops at that time.

If I look at both the Master and Slave Databases there are no long running queries that would cause this. Everything appears to run as expected, and as mentioned the slaves at site A are all fine and keep up to date.

The databases in question are all running different queries so it's not a specific query upsetting things.

There is nothing in the error logs for MySQL.

We have simulated a short network failure (although from our monitoring we can't see any network outages around these times) and the databases work as expected. As soon as the network is reconnected they resume replication.

This is fixed by running stop slave; start slave; at which point everything carries on as if nothing had happened.

Has anyone else had a similar problem? or could shed some light on what may be happening. My gut feeleing is there is a very brief network outage, too short for the monitoring to catch, but why this would upset MySQL I don't know.

Best Answer

What catches my attention is the word 'NETWORK'.

Communication between Master and Slave is implemented as bidirectional.

According to the MySQL Documentation on Replication

Binlog dump thread. The master creates a thread to send the binary log contents to a slave when the slave connects. This thread can be identified in the output of SHOW PROCESSLIST on the master as the Binlog Dump thread.

The binary log dump thread acquires a lock on the master's binary log for reading each event that is to be sent to the slave. As soon as the event has been read, the lock is released, even before the event is sent to the slave.

Slave I/O thread. When a START SLAVE statement is issued on a slave server, the slave creates an I/O thread, which connects to the master and asks it to send the updates recorded in its binary logs.

The slave I/O thread reads the updates that the master's Binlog Dump thread sends (see previous item) and copies them to local files that comprise the slave's relay log.

The state of this thread is shown as Slave_IO_running in the output of SHOW SLAVE STATUS or as Slave_running in the output of SHOW STATUS.

Given this description of the I/O aspect of Replication, what could you look for ???

FIREWALL

The connection between Master and Slave requires that the firewall be open. Unfortunately, I have seen occasions where the firewall was open on the Master and a Slave would connect as usual. The Slave would have the I/O thread show up in the processlist like nothing was wrong. The Master would do the same. All of a sudden, 60 seconds later, the I/O thread disappears from the processlist of the Master, but remains visible on the Slave.

Given that scenario (that I eyewitnessed between two Amazon EC2 servers in two different AZs (Availability Zones)), the solution back then was to check the Security Groups and get port 3306 open in the Slave's AZ.

TIMEOUT

MySQL has settings for timing out network connections

From the MySQL Docs:

  • rpl_semi_sync_master_timeout : A value in milliseconds that controls how long the master waits on a commit for acknowledgment from a slave before timing out and reverting to asynchronous replication. The default value is 10000 (10 seconds). This variable is available only if the master-side semisynchronous replication plugin is installed.
  • net_read_timeout : The number of seconds to wait for more data from a connection before aborting the read. When the server is reading from the client, net_read_timeout is the timeout value controlling when to abort. When the server is writing to the client, net_write_timeout is the timeout value controlling when to abort. See also slave_net_timeout.
  • net_write_timeout : The number of seconds to wait for a block to be written to a connection before aborting the write. See also net_read_timeout.
  • slave_net_timeout : The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect. The first retry occurs immediately after the timeout. The interval between retries is controlled by the MASTER_CONNECT_RETRY option for the CHANGE MASTER TO statement, and the number of reconnection attempts is limited by the --master-retry-count option. Prior to MySQL 5.7.7, the default was 3600 seconds (one hour). In MySQL 5.7.7 and later the default is 60 seconds (one minute).

CONCERNS

Why talk about the network like this ??? You can be victimized in the form of data drift. Back on Jun 17, 2014, I answered the post I have been tasked with Mysql Master-Master replication?. I briefly mentioned the network as an unsung hero in data drift:

  • Networking
    • You must make sure the network traffic is not inundated, your switches are fine, any crossover cabling to bypass the network is not defective, there are no dropped packets,.
    • In terms of MySQL, the Replication I/O thread is dependent on the network. You must make sure mysqld can regularly heartbeat its master. Perhaps semisync replication could be installed and configured for this.

YOUR ACTUAL QUESTION

You running STOP SLAVE; and START SLAVE does not find the root cause but does indeed solve the problem at hand. How ??? All this does is disconnect both the I/O and SQL Threads and then reconnect from scratch.

You could also have done

STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;

which would also do just fine, especially if the SQL thread is busy and you do not want interrupt it.

You will need to check the connection between the Master and Slave for dropped packets.

If your monitoring has the same time granularity as the MySQL timeout values, you have nothing to alert you when it does happen. You would have to poll MySQL more frequently. As an alternative, you could probably create some kind of SNMP setup to monitor MySQL so if SNMP info from does not update in a timely manner you could detect MySQL being down or not responding without ever connecting to MySQL.

My answer may not have fully defined the root cause, but I do have two suggestions:

SUGGESTION #1

Look into your max_allowed_packet setting. Many times in the DBA StackExchange I have affectionately called the MySQL Packet the Silent Killer of DB Connections. The I/O thread is just as much as DB Connection as any other. I would make sure max_allowed_packet is always set to 1073741824 (which is 1G).

SUGGESTION #2

You could manually set the I/O Thread Heartbeat. How ?

According to the MySQL 5.5 Documentation for CHANGE MASTER TO

MASTER_HEARTBEAT_PERIOD sets the interval in seconds between replication heartbeats. Whenever the master's binary log is updated with an event, the waiting period for the next heartbeat is reset. interval is a decimal value having the range 0 to 4294967 seconds and a resolution in milliseconds; the smallest nonzero value is 0.001. Heartbeats are sent by the master only if there are no unsent events in the binary log file for a period longer than interval.

Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the value of slave_net_timeout divided by 2.

Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to reset it to the default value.

Based on these paragraphs and the default value for slave_net_timeout (60 Seconds), it appears that the I/O thread should heartbeat every 30 seconds. You could change the heartbeat period to 10 seconds like this:

STOP SLAVE;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 10;
START SLAVE;