Mysql 5.6 Replication Lag fluctuating between 0 and n

MySQLmysql-5.6replication

I have one master and 7 slaves. During high load on my master, I see replication lag and it keeps fluctuating between 0 and n (where n keeps increasing with time and I have seen n grow more than 1 hour). Fluctuations happen in a matter of seconds i.e. sec:1 – Lag:0s, sec:2 – Lag:2000s, sec:3 – Lag:0s, sec:4 – Lag:2002s,

When seconds_behind_master is 0; show slave status\G says: "Slave has read all relay log; waiting for the slave I/O thread to update it".
When seconds_behind_master is n; show slave status\G says: either "Reading event from the relay log" or "System Lock".
On Master "show processlist" tells the replication thread has status "Sending binlog event to slave" always.

With the above points, I have figured that my SQL thread is not lagging and it's the IO thread which is the culprit. I read that network slowness can cause this issue, but network is not a bottleneck, as I have verified the bandwidth used between master and slaves is only 50%. When I turned on slave_compress_protocol, network usage went down but I was still seeing the replication lag grow in a fluctuating fashion.

I want to know what can be other causes apart from network which can cause this issue. I have gone through: https://www.percona.com/blog/2013/09/16/possible-reasons-when-mysql-replication-lag-is-flapping-between-0-and-xxxxx/ and couldn't attribute my lag to any of the points mentioned in the post.

Also, when the load on master stops, replication lag stops fluctuating and starts decreasing steadily from n and finally catches up.

Thanks.

Edit:

Can it happen that due to heavy load on master (% CPU utilisation is hitting 100%), IO thread is waiting intermittently to read from the binlogs)?

Best Answer

Seconds_behind_master bouncing between 0 and some 'large' value?

I have seen that scenario for 16 years. I have never found the cause or cure. The problem usually goes away after a day or two.

Bottom line: Ignore it.

Related Solutions

MySQL Slave lag in SHOW SLAVE STATUS does not match SHOW PROCESSLIST

The "Time" in the SQL thread is (I think) identical to Seconds_behind_master. It is "How long ago did this query start on the Master ".

All other Times are indicate when the query started on the Slave.

Some fluctuation is caused by what it is measuring (the Master's start time).

Sometimes (rarely), I see the value (both places) bouncing between 0 and some large value. I have yet to track this down. I have seen it on 4.0, 4.1, and 5.1. It eventually goes away, and becomes civilized.

There may be cases where no traffic leads to strange values. But I don't have any Master-Slave setups with little enough traffic for me to comment.

Suppose you do ALTER on the Master, and it took 1 hour (3600 seconds). Also, suppose not much else is going on. The ALTER replicates and starts running. Immediately, the Seconds_behind_master will be about 3600. After the ALTER finishes on the Slave (say, 3600 more seconds later), subsequent replication items will execute with (probably) smaller Times. Eventually replication catches up.

Mysql – Is MySQL Replication Affected by a High-Latency Interconnect

The direct answer to your question is Yes, but it depends on the version of MySQL you are running. Before MySQL 5.5, replication would operate as follows:

Master Executes SQL
Master Records SQL Event in its Binary Logs
Slave Reads SQL Event from Master Binary Logs
Slave Stores SQL Event in its Relay Logs via I/O Thread
Slave Reads Next SQL Event From Relay Log via SQL Thread
Slave Executes SQL
Slave Acknowledges Master of the Complete Execution of the SQL Event

As of MySQL 5.5, using Semisynchronous Replication, now replication would operate as follows:

Master Executes SQL
Master Records SQL Event in its Binary Logs
Slave Reads SQL Event from Master Binary Logs
Slave Acknowledges Master of the Receipt of the SQL Event
Slave Stores SQL Event in its Relay Logs via I/O Thread
Slave Reads Next SQL Event From Relay Log via SQL Thread
Slave Executes SQL
Slave Acknowledges Master of the Complete Execution of the SQL Event

This new paradigm will permit a Slave to be closer sync'd to its Master.

Notwithstanding, latency within the network could hamper MySQL Semisync Replication to the point where it reverts back to the old-style asynchronous replication. Why ? If a timeout occurs without any slave having acknowledged the transaction, the master reverts to asynchronous replication. When at least one semisynchronous slave catches up, the master returns to semisynchronous replication.

UPDATE 2011-08-08 14:22 EDT

The configuration of MySQL 5.5 Semisynchronous Replication is straightforward

Step 1) Add these four(4) lines to /etc/my.cnf

[mysqld]
plugin-dir=/usr/lib64/mysql/plugin
#rpl_semi_sync_master_enabled
#rpl_semi_sync_master_timeout=5000
#rpl_semi_sync_slave_enabled

Step 2) Restart MySQL

service mysql restart

Step 3) Run these commands in the MySQL client

INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
INSTALL PLUGIN rpl_semi_sync_slave  SONAME 'semisync_slave.so';

Step 4) Uncomment the three rpm_semi_sync options after the plugin-dir option

[mysqld]
plugin-dir=/usr/lib64/mysql/plugin
rpl_semi_sync_master_enabled
rpl_semi_sync_master_timeout=5000
rpl_semi_sync_slave_enabled

Step 5) Restart MySQL

service mysql restart

All Done !!! Now just setup MySQL Replication as usual.

Best Answer

Related Solutions

MySQL Slave lag in SHOW SLAVE STATUS does not match SHOW PROCESSLIST

Mysql – Is MySQL Replication Affected by a High-Latency Interconnect

Related Question