MySQL – Troubleshooting Slave Timeouts and Seconds_Behind_Master Jumping

mariadbMySQLreplication

I have 2 slaves (10.1.34-MariaDB) which have been running smoothly for months.
Suddenly, my lightsail/ec2 slave is behaving strangely.

  • While SHOW SLAVE STATUS\G shows both IO/SQL are running, Seconds_Behind_Master is jumping between 0 and >10000
  • On STOP SLAVE; START SLAVE; the Seconds_Behind_Master value goes to >10000 (this is correct) but after 10-20s jumps to 0 and and remains there.
  • While Seconds_Behind_Master = 0, the Relay_Log_Pos, Exec_Master_Log_Pos, Relay_Log_Space all stand still, but tick over, very slowly (master is writing about >1000 qps)
  • It will, after a few minutes, continue as normal for another 10-20s before repeating, this is after an error in the log:

    Oct 31 09:49:47 ip-xx mysqld[5682]: 2018-10-31 9:49:47 140653101021952 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)

    Oct 31 09:49:47 ip-1xx mysqld[5682]: 2018-10-31 9:49:47 140653101021952 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.002619' at position 11971257; GTID position '1-102-639556882,0-0-36714

My other slave is behaving normally. The ping between master/slave is fine. Using Ubuntu 18.04.1 ufw with 22/3306 open.

I have already applied the solution supplied in one answer to the question MySQL 5.6: Slave_IO thread stops working by setting MASTER_HEARTBEAT_PERIOD = 1, but the master is always busy, so this is not helping.

Any ideas?

Output of Show Slave Status

 show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: x.x.x.x
                  Master_User: slave_user
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: binlog.002619
          Read_Master_Log_Pos: 19098938
               Relay_Log_File: relay-log.000006
                Relay_Log_Pos: 1283648
        Relay_Master_Log_File: binlog.002619
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: db
          Replicate_Ignore_DB: 
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13254037
              Relay_Log_Space: 15338312
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File: /etc/mysql/ssl/ca-cert.pem
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 101
               Master_SSL_Crl: /etc/mysql/ssl/ca-cert.pem
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-102-639578825,0-0-367142125,2-1-3159
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)

Related conf (not all)

[mysqld]
bind-address=0.0.0.0
log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index
log_slave_updates=1
expire_logs_days        = 5
max_binlog_size         = 100M
server_id=1002
binlog_format=ROW
slave_net_timeout=10

Space according to df

Filesystem     1K-blocks      Used Available Use% Mounted on
udev             1007832         0   1007832   0% /dev
tmpfs             203976      1344    202632   1% /run
/dev/xvda1             x         x        x    15% /
tmpfs            1019864         0   1019864   0% /dev/shm
tmpfs               5120         0      5120   0% /run/lock
tmpfs            1019864         0   1019864   0% /sys/fs/cgroup
/dev/loop0         13056     13056         0 100% /snap/amazon-ssm-agent/495
/dev/loop1         90112     90112         0 100% /snap/core/5328
/dev/loop2         16896     16896         0 100% /snap/amazon-ssm-agent/784
/dev/loop3         89984     89984         0 100% /snap/core/5662
/dev/loop4         16768     16768         0 100% /snap/amazon-ssm-agent/734
/dev/xvdf              x         x         x  80% /mnt/block1
/dev/loop5         89984     89984         0 100% /snap/core/5742
tmpfs             203972         0    203972   0% /run/user/1000

Best Answer

I've seen that as far back as 4.0. I tried to chase it down but failed. It always went away.

Bottom line: "Not a problem".

If you can create a reproducible test case, file a bug with bugs.mysql.com