I've got a strange replication problem. For reference I am using MySQL 5.5 with Statement Based Replication. We have a Master with Slaves on site A / network A, and slaves on site B / network B.
Everything on network A is fine. The problem occurs with the databases on site B / network B that are connecting to the master at Site A.
On a few occasions now I have seen replication stop on the databases at site B. If I look at the SLAVE_IO_RUNNING
and SLAVE_SQL_RUNNING
in SHOW SLAVE STATUS
they both say yes
. To all extents and purposes everything looks fine to me. However, my monitoring is reporting that the heartbeat event we run is falling behind. It's as if the slave was connected to the master, but not receiving any data.
The various pos
values (read_master_log_pos
, relay_log_pos
, exec_master_log_pos
) are all static and not moving. I also checked the relay log on one, and the incoming data just stops at that time.
If I look at both the Master and Slave Databases there are no long running queries that would cause this. Everything appears to run as expected, and as mentioned the slaves at site A are all fine and keep up to date.
The databases in question are all running different queries so it's not a specific query upsetting things.
There is nothing in the error logs for MySQL.
We have simulated a short network failure (although from our monitoring we can't see any network outages around these times) and the databases work as expected. As soon as the network is reconnected they resume replication.
This is fixed by running stop slave; start slave;
at which point everything carries on as if nothing had happened.
Has anyone else had a similar problem? or could shed some light on what may be happening. My gut feeleing is there is a very brief network outage, too short for the monitoring to catch, but why this would upset MySQL I don't know.
Best Answer
What catches my attention is the word 'NETWORK'.
Communication between Master and Slave is implemented as bidirectional.
According to the MySQL Documentation on Replication
Given this description of the I/O aspect of Replication, what could you look for ???
FIREWALL
The connection between Master and Slave requires that the firewall be open. Unfortunately, I have seen occasions where the firewall was open on the Master and a Slave would connect as usual. The Slave would have the I/O thread show up in the processlist like nothing was wrong. The Master would do the same. All of a sudden, 60 seconds later, the I/O thread disappears from the processlist of the Master, but remains visible on the Slave.
Given that scenario (that I eyewitnessed between two Amazon EC2 servers in two different AZs (Availability Zones)), the solution back then was to check the Security Groups and get port 3306 open in the Slave's AZ.
TIMEOUT
MySQL has settings for timing out network connections
From the MySQL Docs:
CONCERNS
Why talk about the network like this ??? You can be victimized in the form of data drift. Back on
Jun 17, 2014
, I answered the post I have been tasked with Mysql Master-Master replication?. I briefly mentioned the network as an unsung hero in data drift:YOUR ACTUAL QUESTION
You running
STOP SLAVE;
andSTART SLAVE
does not find the root cause but does indeed solve the problem at hand. How ??? All this does is disconnect both the I/O and SQL Threads and then reconnect from scratch.You could also have done
which would also do just fine, especially if the SQL thread is busy and you do not want interrupt it.
You will need to check the connection between the Master and Slave for dropped packets.
If your monitoring has the same time granularity as the MySQL timeout values, you have nothing to alert you when it does happen. You would have to poll MySQL more frequently. As an alternative, you could probably create some kind of SNMP setup to monitor MySQL so if SNMP info from does not update in a timely manner you could detect MySQL being down or not responding without ever connecting to MySQL.
My answer may not have fully defined the root cause, but I do have two suggestions:
SUGGESTION #1
Look into your max_allowed_packet setting. Many times in the DBA StackExchange I have affectionately called the MySQL Packet
the Silent Killer of DB Connections
. The I/O thread is just as much as DB Connection as any other. I would make sure max_allowed_packet is always set to 1073741824 (which is 1G).SUGGESTION #2
You could manually set the I/O Thread Heartbeat. How ?
According to the MySQL 5.5 Documentation for
CHANGE MASTER TO
Based on these paragraphs and the default value for slave_net_timeout (60 Seconds), it appears that the I/O thread should heartbeat every 30 seconds. You could change the heartbeat period to 10 seconds like this: