MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= ‘ANONYMOUS’;

binlogMySQLreplication

I recently installed two identical default installations of MySQL 5.7 under Ubuntu Server 16.04, and configured them to do a binary log replication. Until now this has been working fine, but suddenly the replication stops continuing and the slave query thread runs at 100% CPU without doing any work.

After some searching I found that the slave status tells that it is way behind master. Using mysqlbinlog on the binlog file indicated by Relay_Master_Log_File and position Exec_Master_Log_Pos, I found out that the statement executed at this position is:

SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

Somehow the slave hangs when trying to execute this statement, sending the CPU load to 100% (which is how I discovered the situation in the first place).

Except by having the slave skip that statement using SET GLOBAL sql_slave_skip_counter=1 it is unclear to me what is the actual cause of this issue, and how I should solve this.

Any help would be really appreciated!

Best Answer

TL;DR: This is probably caused by poor table design combined with ROW-based replication.

I just ran into this problem. I was asked to move an old database to a new server and set up replication.

I found that it's not actually the statement in the subject that causes the slave to hang (SET @@SESSION.GTID_NEXT= 'ANONYMOUS'). This statement is issued at the beginning of a transaction.

mysql> SHOW BINLOG EVENTS IN 'mysql-bin.000196' FROM 96754384 LIMIT 5000;
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| Log_name         | Pos       | Event_type     | Server_id | End_log_pos | Info                                         |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
| mysql-bin.000196 |  96754384 | Anonymous_Gtid |         1 |    96754449 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS'         |
| mysql-bin.000196 |  96754449 | Query          |         1 |    96754537 | BEGIN                                        |
| mysql-bin.000196 |  96754537 | Table_map      |         1 |    96754608 | table_id: 241 (db.bad_table)                 |
| mysql-bin.000196 |  96754608 | Delete_rows    |         1 |    96762805 | table_id: 241                                |
| mysql-bin.000196 |  96762805 | Delete_rows    |         1 |    96771002 | table_id: 241                                |
...
| mysql-bin.000196 | 106681175 | Delete_rows    |         1 |   106689372 | table_id: 241                                |
| mysql-bin.000196 | 106689372 | Delete_rows    |         1 |   106697569 | table_id: 241                                |
| mysql-bin.000196 | 106697569 | Delete_rows    |         1 |   106697626 | table_id: 241 flags: STMT_END_F              |
| mysql-bin.000196 | 106697626 | Xid            |         1 |   106697657 | COMMIT /* xid=28382600 */                    |
| mysql-bin.000196 | 106697657 | Rotate         |         1 |   106697704 | mysql-bin.000197;pos=4                       |
+------------------+-----------+----------------+-----------+-------------+----------------------------------------------+
1219 rows in set (0.02 sec)

This table has 66 million rows. I found that it has no primary key or unique key. The query responsible for this uses an index scan on the master.

mysql> SHOW VARIABLES LIKE '%binlog_format%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+
1 row in set (0.00 sec)

For the slave to replicate this with ROW-based replication, it needs to perform approximately 1200 full table scans on the slave. 1200 is probably a fairly small number here. It could be in the hundreds of thousands. The replication does actually work, but with this design, 'seconds_behind_master' will grow indefinitely.

I will add a primary key and partitioning to this table. I will also ask my colleagues to rewrite their code so bulk deletes are no longer necessary. This probably requires adding an additional column.

EDIT: I don't have enough points to comment on other posts, so I will add my comments here for now. I believe that issuing 'SET GLOBAL sql_slave_skip_counter = 1', as mentioned by others, will skip the entire transaction and lead to data inconsistencies. Correct me if I'm wrong.

A quick fix would be to change the binlog format to QUERY or MIXED. These formats can also lead to data inconsistencies, so I would recommend finding and fixing the root cause instead of changing the binlog format.

UPDATE 2013-04-05 15:00 EDT

If MySQL Replication continues falling behind (Seconds_Behind_Master keeps increasing) while the Master keeps getting log-jammed with small LOAD DATA LOCAL INFILE commands, there is only one more thing I could suggest: Most people never touch sync_binlog, which is normally zero. What effect can this have on replicating LOAD DATA LOCAL INFILE ?

According to the Documentation on sync_binlog, this setting can be used to flush binlog changes to disk. Since it is 0 by default, your Master DB Server is at the mercy of the OS because OS is responsible for flusihng binlog changes. When you set sync_binlog to 1, everything may actually change for better. How?

Here is what is probably happening when sync_binlog is 0 on a Master:

You run LOAD DATA LOCAL INFILE
mysqld on Master writes the command to the binlog
mysqld on Master writes the entire CSV file into the binlogs
mysqld on Master leaves it to the OS to flush binlogs changes
Slave reads all binlogs info from Master except the last binlog that the Master's OS did not flush
Slave status shows it is trying to retrieve the remaining info

This is how sync_binlog can hopefully improve things:

You run SET GLOBAL sync_binlog = 1;
You run LOAD DATA LOCAL INFILE
You run SET GLOBAL sync_binlog = 0;
mysqld on Master writes the command to the binlog
mysqld on Master writes the entire CSV file into the binlogs
mysqld on Master flushes every write to the binlogs because sync_binlog = 1
Slave reads all binlogs info from Master
Slave status should show it has read every needed binlog

Give it a Try !!!

UPDATE 2013-04-09 11:23 EDT

If you have a low-to-moderate amount of writes (INSERT, UPDATE, DELETE, and ALTER TABLE) in the Master DB Server, leaving sync_binlog at 1 may not be a bad idea. You would then need to do the following:

STEP 01) On the Slave, run STOP SLAVE;

STEP 02) On the Master, add sync_binlog=1 to /etc/my.cnf:

[mysqld]
sync-binlog=1

STEP 03) Run one of the following on the Master:

SET GLOBAL sync_binlog = 1;

service mysql restart

STEP 04) On the Slave, run START SLAVE;

Give it a Try !!!

MySQL Slave has unmanagbly higher IO load than master

Munin's formula for "device utilization" is (milliseconds spent doing I/O)/second, which assumes you can't do any I/O in parallel, so I'm not sure that this is a meaningful metric. However, you do clearly have a genuine performance issue here since replication can't stay caught up.

Is there a significant difference in the I/O subsystems between the servers?

The way I would approach this is to compare the I/O subsystems and then if there isn't a glaring difference in architecture, run some I/O benchmarking at a quiet time with a tool like bonnie++ or fio to narrow down the difference in performance.

Note that the slave likely has an equal write workload to the master because it has replay all the writes, plus it may have a considerable read workload (for reading any parts of the database that have to be updated) and might be less efficient for reads than the master because it has less RAM for caching. In this situation I would not necessarily expect the slave to be able to catch up because it's entirely possible for the master to generate work at a rate faster than the slave can consume it.

Best Answer

Related Solutions

MySQL slave stuck in “Reading event from the relay log”

UPDATE 2013-04-05 15:00 EDT

UPDATE 2013-04-09 11:23 EDT

MySQL Slave has unmanagbly higher IO load than master

Related Question