MySQL Replication Lag Behaving Erraticly

MySQLreplication

I've got a vanilla MySQL replication, using "MIXED" mode.

The mysql slave lag calculation behaves very strangely – one moment it is 0 and the one after it it's 3630 seconds (or a similar number), then back to 0, and so on and so forth. Obviously, something is wrong in the replication configuration, since MySQL computes the lag based on the timestamps in the relay log.

I've tried checking the time of the servers, which is identical (using SELECT NOW() in MySQL). I've also checked the timezone (SELECT @@system_time_zone) which is set to CDT on both master and slave.

What else can I verify to make sure this issue is resolved? Did anyone else encounter this problem?

Best Answer

MySQL Replication performs weird calculations for Seconds_Behind_Master based on a few things.

  • What NOW() returns for SQL run on the master as recorded in relay logs
  • What NOW() returns on the slave
  • The last position from the master that was executed

Here is a SHOW SLAVE STATUS\G and how to identify the Last Executed SQL from the Master

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.17.20.102
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.002814
          Read_Master_Log_Pos: 823078734
               Relay_Log_File: relay-bin.007364
                Relay_Log_Pos: 823078879
        Relay_Master_Log_File: mysql-bin.002814
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 823078734
              Relay_Log_Space: 823079071
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

Notice the following fields in SHOW SLAVE STATUS\G

  • Master_Log_File (Line 6) : Log File on the Master whose position was last read
  • Read_Master_Log_Pos (Line 7) : Last position read on the Slave from the Master
  • Relay_Master_Log_File (Line 10) : Log File on the Master whose position was last executed
  • Exec_Master_Log_Pos (Line 22) : Last position executed on the Slave from the Master
  • Relay_Log_Space (Line 23) : Sum of bytes from all relay logs

To calculate Seconds_Behind_Master, it would essentially figured out

NOW() as recorded in Master LogFile 'Master_Log_File' LogPos 'Read_Master_Log_Pos'
minus
NOW() on the Slave

This number can intermittently grow because the I/O thread can collect more entries from the Master and loading it into the last relay log while processing the SQL statement located at Master LogFile 'Relay_Master_Log_File' and Master LogPos 'Exec_Master_Log_Pos'. This increasing number is also manifested by three(3) variables changing: Relay_Log_Space, Master_Log_File, Read_Master_Log_Pos. If Seconds_Behind_Master is 0, this is good indication that Read_Master_Log_Pos and Exec_Master_Log_Pos virtually the same. Also worth noting is the fact that long running SQL can fool you into thinking that replication is behind. Once that SQL statement is complete and no other statements have sufficiently backlogged, Seconds_Behind_Master can drop dramatically, even to 0.

So, why would Seconds_Behind_Master bounce back and forth between 0 and an increasing number. It is possible for network latency in transmitting SQL through the I/O thread to contribute to miscalculation because the needed TIMESTAMP variable was not reached yet in the back of the last relay log (remember this is asynchronous replication). To verify this, run mysqlbinlog against any binary log or relay log and see where the TIMESTAMP variables are written between statements.

This behavior was first addressed June 22, 2007 and the bug was supposedly fixed then.

Here is one sure way to take MySQL Replication and screw its head on correctly (For this example, use the same SHOW SLAVE STATUS\G)

Using Relay_Master_Log_File mysql-bin.002814 and Exec_Master_Log_Pos 823078734 as the Log FilePosition to Restart From, run these commands

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.002814',MASTER_LOG_POS=823078734;
START SLAVE;

These steps should 1) kill both I/O and SQL Threads, 2) wipe out All Collected Relay Logs, 3) start with a New Relay Log, and 4) establish New I/O and SQL Threads.

Replication should be good to go from here. As to whether or not this bug can truly be addressed, this now in Oracle's hands (That last statement does not inspire any confidence, does it ???)