Mysql – Replication Hung – Seconds_Behind_Master Increasing

master-slave-replicationMySQLmysql-5.6replication

One of my slave is no longer replicating. The seconds_behind_master continue to increase, Exec_Master_Log_Pos does not increase, and Relay_Log_Space does increase. Slave_IO_Running and Slave_SQL_Running are yes (unless I stop it, or encounter the 1205).

I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE which still produce the same behavior.

Additionally when I run:

stop slave;

on my instance it takes +30 seconds to execute.

Query OK, 0 rows affected (33.97 sec)

show slave status\G

returns:

               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.40.203
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000779
          Read_Master_Log_Pos: 881930813
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.000779
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: percona.%
  Replicate_Wild_Ignore_Table: test.%
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 771399898
              Relay_Log_Space: 110531372
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2222
                  Master_UUID: example
             Master_Info_File: /mnt/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: updating
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0

I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.

If I leave the replication running I end up with a 1205 error:

Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

UPDATE:

Running SHOW PROCESSLIST brought back:

348 | replicant  | serverDNS | NULL  | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated

After finding this we altered innodb_lock_wait_timeout from 50, its default value, to 14400. This allowed the replication to process again. However it is unclear why the 50 timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge AWS instances so they have the same number of resources.

Additionally, should I stop at 14400 or should I just set this to the max 1073741824?

Update 2:

If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.

Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.

Current relevant (or in my eyes) slave output lines:

         Master_Log_File: mysql-bin.000786
      Read_Master_Log_Pos: 131895019
           Relay_Log_File: mysqld-relay-bin.000025
            Relay_Log_Pos: 52668949
    Relay_Master_Log_File: mysql-bin.000786
        Exec_Master_Log_Pos: 91692081
          Relay_Log_Space: 131895472
             Seconds_Behind_Master: 12163

The 91692081 is the value it currently is stuck at.

Update 3:

Looking into it further OS file reads, OS file writes, and OS fsyncs are consistently increasing. I also have found a warning being logged:

Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.

Best Answer

Since your other slaves are running fine, this is possibly caused by user error. Some row was probably changed on this slave manually, which was meant to be changed on the master. The slave then encounters something like duplicate key constraint 10 times and fails. If the transaction causing this error is slow or it effects a lot of rows combined with ROW based replication, it can take a long time before the replication fails. See MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

Try to issue SHOW BINLOG EVENTS IN 'mysql-bin.000779' FROM 771399898 LIMIT 500; to find the offending query.

These values are from your SHOW SLAVE STATUS output. I put the LIMIT at 500 as a starting point, but increase it if it doesn't give you enough data.

It may be faster to set up the replication again (using a non-locking tool like Innobackupex) than to troubleshoot the root cause and solution of this problem.