MySQL Replication not proceeding

MySQLmysql-5.5percona-serverreplication

I have a weird replication problem I have not seen before. It's basic mysql replication with single master and two slaves. One of the slaves is not executing replication events and seems just stuck at some point yesterday. The other is current.

  • Running show slave status on the problematic slave shows no increases in any of the counters.
  • It lists both the IO and SQL threads as running.
  • Seconds behind reports 0.
  • None of the log counters are increasing.
  • No errors are reported. Running stop/start slave return no errors. Bouncing the server reports nothing out of the ordinary in the .err log and says it's picking up replication from the relay log position it's stuck on
  • The master shows the slave as connected
  • The slave shows two system user replication threads reporting "Waiting for master to send event" and "Slave has read all relay log; waiting for the slave I/O thread to update it". Their Time counters in the process list are just steadily increasing.
  • Attempts to connect from the slave w/ the replication credentials to the master via commandline client work fine
  • There is plenty of disk space in both the datadir and logdir
  • The Master_log_file it's reporting still exists on the master according to both show binary logs and looking at the actual filesystem (it wasn't pruned or manually deleted from the FS)
  • The master and both slaves are running the same percona build (5.5.29-30.0-log) and have been as such for many months.

I'm at a loss on what to further troubleshoot. Help?

Best Answer

This turned out to not be a mysql issue at all. The network team recently installed a new security device that will block packets on certain rules. A legitimate database write contained a sequence of characters the device deemed nefarious.

The overall connection handshake for replication was able to make it fine but then it just sat there asking the master for the next log entry whose packets were never making it back.

As far as the broken slave was concerned, it was up to date b/c it had executed the most recent event in the relay logs it had received.