MySQL Master – Slave replication stops after 20min with no errors

mariadbMySQLreplication

I'm using MariaDB 10.4 in a standard master slave setup.

Everything works fine for about 20min, I can see changes replicating across but then after about 20min replication just stops.

However, if I restart the slave MySQL instance, replication begins again and catches up.

Restarting the master has no effect on the replication it stays broken.

When in the broken non-replicating state this is the output from the slave and master.

enter image description here
enter image description here

So it looks like it thinks its caught up, but it hasn't, no matter how long I leave it.

I've used netcat to test the connection and I know it can see the master still, plus when I restart MySQL it works right away.

The my.cnf config has not been changed apart from adding the server ID and this is how I configured the slave initally CHANGE MASTER TO MASTER_HOST='dynamicip.xxxxxxx.org', MASTER_PORT = 4302, MASTER_USER='remote_replication', MASTER_PASSWORD='xxxxxxxxxx', MASTER_LOG_FILE='mysql-bin.000006', MASTER_LOG_POS=328;

The error logs contain no errors also.

Where would I begin to troubleshoot this?

EDIT / UPDATE

So it appears the data does replicate over eventually but it takes a least 50 minutes, and its tiny bits of data I'm testing it with.

I've switched the hostname dymanic.****.org to an IP address in case it was Cloudflare doing something weird but that hasn't fixed it. I also tried skip-host-cache before that.

I've also switched innodb_flush_log_at_trx_commit from 0 to 1.
Traceroute looks fine.
I can see the statements in the binlog file on the master.

Interestingly using STOP SLAVE and START SALVE kicks it back into action for a while before it begins having the issue again.

SHOW PROCESSLIST on the Master shows that the I/O thread is sometimes present and sometimes not, not sure what that means..

Best Answer

The first thing I would do is check the network between the Master and Slave. Why ???

Almost 8 years ago, I wrote this post : Slave I/O thread dies very often

I once had an occasion where I saw a Slave with SHOW SLAVE STATUS\G looking exactly as you have it, but it was not replicating. This is what actually happened:

  • The firewall on the Slave would become intermittent
  • The Master could not talk back to the Slave through the I/O thread
  • SHOW PROCESSLIST on the Master revealed that the I/O thread was no longer there
  • Slave assumed the Master was OK

What I recommend is this

  • Start troubleshooting the traceroute on dynamic....org
  • See if DNS is acting up. If DNS is involved, add skip-name-resolve and skip-host-cache to my.cnf on Master and Slave and restart mysqld

Please let us know what you found in your troubleshooting