Mysql 5.6 Replication Lag fluctuating between 0 and n

MySQLmysql-5.6replication

I have one master and 7 slaves. During high load on my master, I see replication lag and it keeps fluctuating between 0 and n (where n keeps increasing with time and I have seen n grow more than 1 hour). Fluctuations happen in a matter of seconds i.e. sec:1 – Lag:0s, sec:2 – Lag:2000s, sec:3 – Lag:0s, sec:4 – Lag:2002s,

  1. When seconds_behind_master is 0; show slave status\G says: "Slave has read all relay log; waiting for the slave I/O thread to update it".
  2. When seconds_behind_master is n; show slave status\G says: either "Reading event from the relay log" or "System Lock".
    On Master "show processlist" tells the replication thread has status "Sending binlog event to slave" always.

With the above points, I have figured that my SQL thread is not lagging and it's the IO thread which is the culprit. I read that network slowness can cause this issue, but network is not a bottleneck, as I have verified the bandwidth used between master and slaves is only 50%. When I turned on slave_compress_protocol, network usage went down but I was still seeing the replication lag grow in a fluctuating fashion.

I want to know what can be other causes apart from network which can cause this issue. I have gone through: https://www.percona.com/blog/2013/09/16/possible-reasons-when-mysql-replication-lag-is-flapping-between-0-and-xxxxx/ and couldn't attribute my lag to any of the points mentioned in the post.

Also, when the load on master stops, replication lag stops fluctuating and starts decreasing steadily from n and finally catches up.

Thanks.

Edit:

Can it happen that due to heavy load on master (% CPU utilisation is hitting 100%), IO thread is waiting intermittently to read from the binlogs)?

Best Answer

Seconds_behind_master bouncing between 0 and some 'large' value?

I have seen that scenario for 16 years. I have never found the cause or cure. The problem usually goes away after a day or two.

Bottom line: Ignore it.