Thesql 5.6 gtid replication slave stuck (system lock)

gtidMySQLreplication

I have set up 5.6 gtid based replication (on 5.6.26) it seemed to work when I did it, it replicated my random test db over that I created beside normal data.
However at some point something must have happened because all I see is this:

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: System lock
                  Master_Host: xxxxxxxxxxxxxxxxxx
                  Master_User: xxxxxxxxxxxxxxxx
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000141
          Read_Master_Log_Pos: 169293671
               Relay_Log_File: mysqld-relay-bin.000003
                Relay_Log_Pos: 16861206
        Relay_Master_Log_File: mysqld-bin.000141
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 16860994
              Relay_Log_Space: 169298584
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 55203
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 7846a847-62c7-11e5-91a6-e06995de432e
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
            Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274
                Auto_Position: 1

now originally "Slave_SQL_Running_State" said "reading event from relay log" or something like that, it has now changed to system lock as well (IO state always said that).

It seems the Seconds_Behind_Master is increasing steadily, and the relay log grows rapidly in size on the filesystem, while Executed_gtid_set does seem to change, but still something seems wrong because it is just so much behind….

Here's the processlist:

mysql> show processlist;
+------+-------------+-----------+------+---------+-------+---------------------------------------+------------------+
| Id   | User        | Host      | db   | Command | Time  | State                                 | Info             |
+------+-------------+-----------+------+---------+-------+---------------------------------------+------------------+
| 1877 | root        | localhost | NULL | Sleep   |  6076 |                                       | NULL             |
| 1878 | root        | localhost | NULL | Query   |     0 | init                                  | show processlist |
| 1886 | system user |           | NULL | Connect |   783 | System lock                           | NULL             |
| 1887 | system user |           | NULL | Connect |     0 | System lock                           | NULL             |
| 1888 | system user |           | NULL | Connect |   783 | Waiting for an event from Coordinator | NULL             |
| 1889 | system user |           | NULL | Connect | 55455 | System lock                           | NULL             |
+------+-------------+-----------+------+---------+-------+---------------------------------------+------------------+

I tried to stop the slave and start it again but it didnt help.

Does anybody have any ideas what I could try to make this work again? Would be much appreciated.

Thanks

Best Answer

Since I see more than 2 system user entries in the processlist, I would assume you are using Multi-Threaded Replication (slave_parallel_workers > 1).

That looks like a bug

On Oct 29, 2014, this was expressed by David Moss

Thank you for your feedback. This issue was covered in bug 17326020 and the following was added to the MySQL 5.6.21 and 5.7.5 changelogs:

When the I/O thread reconnected to a master using GTIDs and multithreaded slaves while in the middle of a transaction, it failed to abort the transaction, leaving a partial transaction in the relay log, and then retrieving the same transaction again. This occurred when performing a rotation of the relay log. Now when reconnecting, the server checks before rotating the log in such cases, and waits first for any ongoing transaction to complete.

Therefore nothing new will be added to cover this bug and I'm closing it as fixed.

On Dec 10, 2014, this was expressed by Laurynas Biveinis

Problem:

With MTS, GTIDs and auto positioning enabled, when a worker applies a partial transaction left on relaylog by an IO thread reconnection, it will wait for the XID log event to commit the transaction.

Unfortunately, the SQL thread coordinator will reach the master's ROTATE event on the next relaylog file and will wait for all workers to finish their tasks before applying the ROTATE.

Analysis:

As the whole transaction is retrieved again by the IO thread after the reconnection, the slave must rollback the partial transaction once noticing this ROTATE from the master.

This bug reports the same issue already fixed by BUG#17326020, and the reported issue is not reproducible anymore. So, this patch is just adding a new test case.

SUGGESTION

Run FLUSH BINARY LOGS; on the Master

See if the movement triggers a response from the SQL threads.

If it does not, go ahead and remove slave_parallel_workers from my.cnf and restart mysql.

Since you started MySQL up and master and slave and got error 1236, that means you are trying to establish replication from an impossible position. In the context of GTID and error message you got, the binary logs needed to fully identify a set of queries within a GTID set no longer exists,

Look back at your SHOW SLAVE STATUS\G

Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
 Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274

From this, the last GTID executed is 7846a847-62c7-11e5-91a6-e06995de432e:4783274

This means that the binary log that has or had 7846a847-62c7-11e5-91a6-e06995de432e:4783275 no longer exists.

I can see this happening if you stopped replication on the Slave, left replication off long enough for the Master to rotate its binary logs (via expire_logs_days) the slave still needed to see, then turned on replication.

In your particular case, try doing a mysqlbinlog dump of the binary log mysqld-bin.000141. If nothing comes out of it, you will have to reload the Slave and setup replication from scratch.