MySQL 5.6 Slave Stops Processing Binlog on Master Restart

binlogMySQLmysql-5.6mysqldumpreplication

I have a pair of Debian 7 servers with MySQL 5.6.23. The slave server is a clone of the master (but the uuid has been regenerated on the slave).

For some reason, whenever MySQL is restarted on the Master via "service mysql restart," the Slave will continue reading the Master_Log_File and Read_Master_Log_Pos, but won't process the Relay_Master_Log_File.

Slave I/O and Slave SQL are both running, but the Seconds_Behind_Master continue to increment.

The Slave_SQL_Running_State says "Waiting for Slave Worker to release partition."

Obviously, I've tried to stop and start the slave, restart the MySQL service on the slave and master, and restart the slave and master servers.

To fix the issue, I have to blank and recreate the slave. I do this via a script I've made to dump the master file and position, use mysqldump to pull the database onto the slave, and then start slaving from the recorded master file and position. The mysqldump flags are --skip-lock-tables --flush-logs --hex-blob --master-data=2 --single-transaction --comments --routines

Am I missing something? Any help is greatly appreciated.

Best Answer

Guess what ? You have a nasty little bug (Replication stall with multi-threaded replication)

This bug was first reported in MySQL 5.6.17 on 20 Jun 2014 15:45

SUGGESTION #1

You need to look over all the release notes from MySQL 5.6.18 up to MySQL 5.6.22 to see if it was resolved. If any of those release notes claim the problem was fixed, then someone missed a patch. I do advise that you do not regress to anything before MySQL 5.6.21 (Any known issues upgrading from MySQL 5.1.73 to 5.6.21? under ISSUE #3 : Security Issues)

SUGGESTION #2

Don't use multithreaded replication until this bug is fixed.

SUGGESTION #3

In the aforementioned bug report, it says this somewhere in the middle

When MySQL comes back up again, the log is rotated, and IO thread starts writing from the event group which was partially written.

. . .

You can see that the "end_log_pos" values are the same for the partial event I showed earlier and this one. It is the same UPDATE transaction that was partially written.

Now comes the interesting part. When the coordinator reads the relay log, it sends the partial event to the worker, since the event is partial hence the worker never commits the transaction and the transaction is kept open. The coordinator reads the next event (which is the full version of the partial event) but the coordinator cannot assign the next event to another worker because of one of the workers having an open transaction.

And apparently, MTS waits for workers to commit transactions when it sees log rotated.

I would suggest not using --flush-logs with mysqldump, just in case there are some issues with group commits that are all-of-a-sudden split by the mysqldump and this bug isn't viewing incoming binlog events correctly.

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

Replication filtering isn't bulletproof. Due to how the filtering is implemented the events responsible for your errors are being generated because the default database at query runtime is the my-database schema as expected and the query being executed is fully qualified INSERT INTO phpmyadmin.pma_column_info...

Peter Zaitsev explains the scenario well in this post:

Filtered MySQL Replication

Thesql 5.6 gtid replication slave stuck (system lock)

Since I see more than 2 system user entries in the processlist, I would assume you are using Multi-Threaded Replication (slave_parallel_workers > 1).

That looks like a bug

On Oct 29, 2014, this was expressed by David Moss

Thank you for your feedback. This issue was covered in bug 17326020 and the following was added to the MySQL 5.6.21 and 5.7.5 changelogs:

When the I/O thread reconnected to a master using GTIDs and multithreaded slaves while in the middle of a transaction, it failed to abort the transaction, leaving a partial transaction in the relay log, and then retrieving the same transaction again. This occurred when performing a rotation of the relay log. Now when reconnecting, the server checks before rotating the log in such cases, and waits first for any ongoing transaction to complete.

Therefore nothing new will be added to cover this bug and I'm closing it as fixed.

On Dec 10, 2014, this was expressed by Laurynas Biveinis

Problem:

With MTS, GTIDs and auto positioning enabled, when a worker applies a partial transaction left on relaylog by an IO thread reconnection, it will wait for the XID log event to commit the transaction.

Unfortunately, the SQL thread coordinator will reach the master's ROTATE event on the next relaylog file and will wait for all workers to finish their tasks before applying the ROTATE.

Analysis:

As the whole transaction is retrieved again by the IO thread after the reconnection, the slave must rollback the partial transaction once noticing this ROTATE from the master.

This bug reports the same issue already fixed by BUG#17326020, and the reported issue is not reproducible anymore. So, this patch is just adding a new test case.

SUGGESTION

Run FLUSH BINARY LOGS; on the Master

See if the movement triggers a response from the SQL threads.

If it does not, go ahead and remove slave_parallel_workers from my.cnf and restart mysql.

Since you started MySQL up and master and slave and got error 1236, that means you are trying to establish replication from an impossible position. In the context of GTID and error message you got, the binary logs needed to fully identify a set of queries within a GTID set no longer exists,

Look back at your SHOW SLAVE STATUS\G

Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
 Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274

From this, the last GTID executed is 7846a847-62c7-11e5-91a6-e06995de432e:4783274

This means that the binary log that has or had 7846a847-62c7-11e5-91a6-e06995de432e:4783275 no longer exists.

I can see this happening if you stopped replication on the Slave, left replication off long enough for the Master to rotate its binary logs (via expire_logs_days) the slave still needed to see, then turned on replication.

In your particular case, try doing a mysqlbinlog dump of the binary log mysqld-bin.000141. If nothing comes out of it, you will have to reload the Slave and setup replication from scratch.