Mysql – Replication Hung – Seconds_Behind_Master Increasing

master-slave-replicationMySQLmysql-5.6replication

One of my slave is no longer replicating. The seconds_behind_master continue to increase, Exec_Master_Log_Pos does not increase, and Relay_Log_Space does increase. Slave_IO_Running and Slave_SQL_Running are yes (unless I stop it, or encounter the 1205).

I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE which still produce the same behavior.

Additionally when I run:

stop slave;

on my instance it takes +30 seconds to execute.

Query OK, 0 rows affected (33.97 sec)

show slave status\G

returns:

               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.40.203
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000779
          Read_Master_Log_Pos: 881930813
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.000779
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: percona.%
  Replicate_Wild_Ignore_Table: test.%
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 771399898
              Relay_Log_Space: 110531372
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2222
                  Master_UUID: example
             Master_Info_File: /mnt/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: updating
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0

I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.

If I leave the replication running I end up with a 1205 error:

Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

~~UPDATE:~~

Running SHOW PROCESSLIST brought back:

348 | replicant | serverDNS | NULL | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated

After finding this we altered innodb_lock_wait_timeout from 50, its default value, to 14400. This allowed the replication to process again. However it is unclear why the 50 timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge AWS instances so they have the same number of resources.

~~Additionally, should I stop at 14400 or should I just set this to the max 1073741824?~~

Update 2:

If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.

Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.

Current relevant (or in my eyes) slave output lines:

         Master_Log_File: mysql-bin.000786
      Read_Master_Log_Pos: 131895019
           Relay_Log_File: mysqld-relay-bin.000025
            Relay_Log_Pos: 52668949
    Relay_Master_Log_File: mysql-bin.000786
        Exec_Master_Log_Pos: 91692081
          Relay_Log_Space: 131895472
             Seconds_Behind_Master: 12163

The 91692081 is the value it currently is stuck at.

Update 3:

Looking into it further OS file reads, OS file writes, and OS fsyncs are consistently increasing. I also have found a warning being logged:

Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.

Best Answer

Since your other slaves are running fine, this is possibly caused by user error. Some row was probably changed on this slave manually, which was meant to be changed on the master. The slave then encounters something like duplicate key constraint 10 times and fails. If the transaction causing this error is slow or it effects a lot of rows combined with ROW based replication, it can take a long time before the replication fails. See MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';

Try to issue SHOW BINLOG EVENTS IN 'mysql-bin.000779' FROM 771399898 LIMIT 500; to find the offending query.

These values are from your SHOW SLAVE STATUS output. I put the LIMIT at 500 as a starting point, but increase it if it doesn't give you enough data.

It may be faster to set up the replication again (using a non-locking tool like Innobackupex) than to troubleshoot the root cause and solution of this problem.

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

Replication filtering isn't bulletproof. Due to how the filtering is implemented the events responsible for your errors are being generated because the default database at query runtime is the my-database schema as expected and the query being executed is fully qualified INSERT INTO phpmyadmin.pma_column_info...

Peter Zaitsev explains the scenario well in this post:

Filtered MySQL Replication

MySQL 5.6 – GTID Replication Slave Stuck with System Lock

Since I see more than 2 system user entries in the processlist, I would assume you are using Multi-Threaded Replication (slave_parallel_workers > 1).

That looks like a bug

On Oct 29, 2014, this was expressed by David Moss

Thank you for your feedback. This issue was covered in bug 17326020 and the following was added to the MySQL 5.6.21 and 5.7.5 changelogs:

When the I/O thread reconnected to a master using GTIDs and multithreaded slaves while in the middle of a transaction, it failed to abort the transaction, leaving a partial transaction in the relay log, and then retrieving the same transaction again. This occurred when performing a rotation of the relay log. Now when reconnecting, the server checks before rotating the log in such cases, and waits first for any ongoing transaction to complete.

Therefore nothing new will be added to cover this bug and I'm closing it as fixed.

On Dec 10, 2014, this was expressed by Laurynas Biveinis

Problem:

With MTS, GTIDs and auto positioning enabled, when a worker applies a partial transaction left on relaylog by an IO thread reconnection, it will wait for the XID log event to commit the transaction.

Unfortunately, the SQL thread coordinator will reach the master's ROTATE event on the next relaylog file and will wait for all workers to finish their tasks before applying the ROTATE.

Analysis:

As the whole transaction is retrieved again by the IO thread after the reconnection, the slave must rollback the partial transaction once noticing this ROTATE from the master.

This bug reports the same issue already fixed by BUG#17326020, and the reported issue is not reproducible anymore. So, this patch is just adding a new test case.

SUGGESTION

Run FLUSH BINARY LOGS; on the Master

See if the movement triggers a response from the SQL threads.

If it does not, go ahead and remove slave_parallel_workers from my.cnf and restart mysql.

Since you started MySQL up and master and slave and got error 1236, that means you are trying to establish replication from an impossible position. In the context of GTID and error message you got, the binary logs needed to fully identify a set of queries within a GTID set no longer exists,

Look back at your SHOW SLAVE STATUS\G

Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
 Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274

From this, the last GTID executed is 7846a847-62c7-11e5-91a6-e06995de432e:4783274

This means that the binary log that has or had 7846a847-62c7-11e5-91a6-e06995de432e:4783275 no longer exists.

I can see this happening if you stopped replication on the Slave, left replication off long enough for the Master to rotate its binary logs (via expire_logs_days) the slave still needed to see, then turned on replication.

In your particular case, try doing a mysqlbinlog dump of the binary log mysqld-bin.000141. If nothing comes out of it, you will have to reload the Slave and setup replication from scratch.

Best Answer

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

MySQL 5.6 – GTID Replication Slave Stuck with System Lock

SUGGESTION

Related Question