Mysql – Seconds_Behind_Master high even though Read_Master_Log_Pos and Exec_Master_Log_Pos are very close

master-slave-replicationMySQL

From what I understand from https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html, seconds_behind_master is the difference in the timestamp of where the IO thread is currently writing in the slave and where the SQL thread is executing now. And the difference between Read_Master_Log_Pos and Exec_Master_Log_Pos should co-relate to the value of seconds behind master. But that is not the case here.

My DB had a significant slave lag, and I was trying to see if it was because of IO thread or the the SQL thread.

This is the result of show slave status.

         Slave_IO_State: Waiting for master to send event
                Master_Host: 10.85.162.164
                Master_User: replication
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: 1085162164-mysql-binary.001952
        Read_Master_Log_Pos: 148220048
             Relay_Log_File: mysqld-relay-bin.002127
              Relay_Log_Pos: 148035459
      Relay_Master_Log_File: 1085162164-mysql-binary.001952
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB: 
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: ____
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: ____
                 Last_Errno: 0
                 Last_Error: 
               Skip_Counter: 0
        Exec_Master_Log_Pos: 148035282
            Relay_Log_Space: 148220460
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: 540
Master_SSL_Verify_Server_Cert: No
              Last_IO_Errno: 0
              Last_IO_Error: 
             Last_SQL_Errno: 0
             Last_SQL_Error: 
Replicate_Ignore_Server_Ids: 
           Master_Server_Id: 1085162164
                Master_UUID: 288abc1b-803e-11e7-9c70-02010a55a2a4
           Master_Info_File: /var/lib/mysql/master.info
                  SQL_Delay: 0
        SQL_Remaining_Delay: NULL
    Slave_SQL_Running_State: freeing items
         Master_Retry_Count: 86400
                Master_Bind: 
    Last_IO_Error_Timestamp: 
   Last_SQL_Error_Timestamp: 
             Master_SSL_Crl: 
         Master_SSL_Crlpath: 
         Retrieved_Gtid_Set: 
          Executed_Gtid_Set: 
              Auto_Position: 0

This is the status on the master.

| File                           | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
| 1085162164-mysql-binary.001952 | 776669711 |              |                  |                   | 

The master_log_file and relay_master_log_file are in the same file, which is also the same file that the master is writing into. Here the difference between Read_Master_Log_Pos and Exec_Master_Log_Pos is also very small around 180000.

But there is a large difference to the position in the position of the master, which would suggest that the IO thread is the one that is lagging.

From the docs, seconds_behind_master's description contained the following statement.

In essence, this field measures the time difference in seconds between the slave SQL thread and the slave I/O thread.

Which should be very low in this case. But it is pretty high at 540 seconds.

Also, the value of seconds behind master is oscillating in values from 500 -> 0 -> 501 -> 0 -> 501 -> 0 -> 502 -> 0. What could cause this? One reason according to https://www.percona.com/blog/2013/09/16/possible-reasons-when-mysql-replication-lag-is-flapping-between-0-and-xxxxx/, might be that the IO link is saturated (#4), but I saw the network usage. It was around 50 MBps while the capacity of the link is 75 MBps. Any other possible reasons that could cause this?

Best Answer

Regarding the seconds_behind_master oscillation. I found that the description in seconds_behind_master in the mysql documentation is not accurate. I checked the source code.

long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
                            - mi->clock_diff_with_master);

Seconds behind master is calculated based on the difference between current timestamp on the slave and the timestamp on the statement which is being executed by the SQL thread and any initial difference in the clocks between the slave and the master.
And if the SQL thread is completely caught up with the IO thread (i.e. SQL thread has executed all the statements in the relay log file), it is shown as 0 in mysql5.6 (This was changed to show NULL in this case in 5.7)

In my case, the IO thread was the one that was lagging, I checked the status of the binlog file on the master and relay log file on the slave. The IO thread was waiting for the binlog_dump thread to send data to the slave. For some reason, the IO thread was not fetching/able to fetch all the data from the master(Could be a network related issue. Looking into it). The rate of data growth on the slave's relay log file was slower than the rate of the data growth on the master's binlog even when there was a lot of data that can be fetched. And before the IO thread is able to fetch more data and write to the relay log file, the SQL thread was finishing executing those statements. And seconds_behind_master becomes 0.

Whenever SQL thread is executing these events in the relay log file, it shows a huge lag and once it has completed executing these events, the value falls to 0.