One of my slave is no longer replicating. The seconds_behind_master
continue to increase, Exec_Master_Log_Pos
does not increase, and Relay_Log_Space
does increase. Slave_IO_Running
and Slave_SQL_Running
are yes
(unless I stop it, or encounter the 1205
).
I've tried the solutions on this thread which sounded similar but haven't had any luck, Slave SQL thread got hanged. I also tried a RESET SLAVE
which still produce the same behavior.
Additionally when I run:
stop slave;
on my instance it takes +30 seconds to execute.
Query OK, 0 rows affected (33.97 sec)
show slave status\G
returns:
Slave_IO_State: Waiting for master to send event
Master_Host: 10.0.40.203
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000779
Read_Master_Log_Pos: 881930813
Relay_Log_File: mysqld-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000779
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: test
Replicate_Do_Table: Users,corporations,dates,systemspecs,test_replication,domains,test,ips,deleteddate,percona_checksum,accesslevels,status,collectionsdata,orders,email_to_user,requests,userprops,percona_checksum,useremails,requests_site,sections,ordertosection,UserToGroup,validkeys
Replicate_Ignore_Table:
Replicate_Wild_Do_Table: percona.%
Replicate_Wild_Ignore_Table: test.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 771399898
Relay_Log_Space: 110531372
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 4784
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2222
Master_UUID: example
Master_Info_File: /mnt/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: updating
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
I have four other slaves of the master that all are functional so I know the master logs aren't corrupt.
If I leave the replication running I end up with a 1205
error:
Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
UPDATE:
Running SHOW PROCESSLIST
brought back:
348 | replicant | serverDNS | NULL | Binlog Dump | 1107340 | Master has sent all binlog to slave; waiting for binlog to be updated
After finding this we altered innodb_lock_wait_timeout
from 50
, its default value, to 14400
. This allowed the replication to process again. However it is unclear why the 50
timeout would occur on only one of 5 slaves. All slaves are m5.2xlarge
AWS instances so they have the same number of resources.
Additionally, should I stop at 14400
or should I just set this to the max 1073741824
?
Update 2:
If I issue a restart for the mysql service replication processes as expected for about a day then the issue reproduces.
Additionally this slave is also a master of another system if that makes a difference. The slave of this master is running fine.
Current relevant (or in my eyes) slave output lines:
Master_Log_File: mysql-bin.000786
Read_Master_Log_Pos: 131895019
Relay_Log_File: mysqld-relay-bin.000025
Relay_Log_Pos: 52668949
Relay_Master_Log_File: mysql-bin.000786
Exec_Master_Log_Pos: 91692081
Relay_Log_Space: 131895472
Seconds_Behind_Master: 12163
The 91692081
is the value it currently is stuck at.
Update 3:
Looking into it further OS file reads
, OS file writes
, and OS fsyncs
are consistently increasing. I also have found a warning being logged:
Warning: difficult to find free blocks in the buffer pool (324 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below.
Best Answer
Since your other slaves are running fine, this is possibly caused by user error. Some row was probably changed on this slave manually, which was meant to be changed on the master. The slave then encounters something like
duplicate key constraint
10 times and fails. If the transaction causing this error is slow or it effects a lot of rows combined with ROW based replication, it can take a long time before the replication fails. See MySQL replication slave hangs after encountering SET @@SESSION.GTID_NEXT= 'ANONYMOUS';Try to issue
SHOW BINLOG EVENTS IN 'mysql-bin.000779' FROM 771399898 LIMIT 500;
to find the offending query.These values are from your
SHOW SLAVE STATUS
output. I put theLIMIT
at500
as a starting point, but increase it if it doesn't give you enough data.It may be faster to set up the replication again (using a non-locking tool like Innobackupex) than to troubleshoot the root cause and solution of this problem.