MySQL Replication – Seconds Behind Master Keeps Growing

MySQLreplication

We have set up a new slave server for our production database. Since than the Seconds_Behind_Master keep growing slowly.

This is the slave output:

           Slave_IO_State: Waiting for master to send event
              Master_Host: notimportant
              Master_User: repl
              Master_Port: 3306
            Connect_Retry: 60
          Master_Log_File: mysql-bin.003790
      Read_Master_Log_Pos: 42585179
           Relay_Log_File: mysqld-relay-bin.002798
            Relay_Log_Pos: 32374374
    Relay_Master_Log_File: mysql-bin.003492
         Slave_IO_Running: Yes
        Slave_SQL_Running: Yes
          Replicate_Do_DB: 
      Replicate_Ignore_DB: 
       Replicate_Do_Table: 
   Replicate_Ignore_Table: 
  Replicate_Wild_Do_Table:  
               Last_Errno: 0
               Last_Error: 
             Skip_Counter: 0
      Exec_Master_Log_Pos: 32374215
          Relay_Log_Space: 31350956440
          Until_Condition: None
           Until_Log_File: 
            Until_Log_Pos: 0
       Master_SSL_Allowed: No
       Master_SSL_CA_File: 
       Master_SSL_CA_Path: 
          Master_SSL_Cert: 
        Master_SSL_Cipher: 
           Master_SSL_Key: 
    Seconds_Behind_Master: 1448477
            Last_IO_Errno: 0
            Last_IO_Error: 
           Last_SQL_Errno: 0
         Master_Server_Id: 6524
              Master_UUID: 
         Master_Info_File: /var/lib/mysql/master.info
                SQL_Delay: 0
      SQL_Remaining_Delay: NULL
  Slave_SQL_Running_State: Reading event from the relay log
       Master_Retry_Count: 86400
              Master_Bind: 
  Last_IO_Error_Timestamp: 
 Last_SQL_Error_Timestamp: 
           Master_SSL_Crl: 
       Master_SSL_Crlpath: 
       Retrieved_Gtid_Set: 
        Executed_Gtid_Set: 
            Auto_Position: 0

Show processlist on SLAVE:

|    1 | system user |           | NULL | Connect | 2143813 | Waiting for master to send event | NULL             |
|    2 | system user |           | NULL | Connect | 1448477 | Reading event from the relay log | NULL             |
| 1628 | root        | localhost | NULL | Query   |       0 | init                             | SHOW PROCESSLIST |

On MASTER side:

|  7947837 | repl       | host-vm:59420     | NULL          | Binlog Dump | 2143817 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL             |

On MASTER the show slave hosts:

+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
|      3410 |      | 3410 |      6524 |
| 347643210 |      | 3306 |      6524 |
+-----------+------+------+-----------+

Does anyone has an idea what's going on and how can I fix it?

Best Answer

The solution was to add the following to my.cnf:

innodb_flush_log_at_trx_commit = 2

Our server was slow in I/O and the row above reduced the disk usage.

It is not recommended for bank databases, but recommended to all others.

Phrased differently: =1 flushes at then end of each transaction, thereby making it ACID-compliant. =2 flushes to disk every second, thereby being much less I/O-intensive.

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

Replication filtering isn't bulletproof. Due to how the filtering is implemented the events responsible for your errors are being generated because the default database at query runtime is the my-database schema as expected and the query being executed is fully qualified INSERT INTO phpmyadmin.pma_column_info...

Peter Zaitsev explains the scenario well in this post:

Filtered MySQL Replication

MySQL 5.6 – GTID Replication Slave Stuck with System Lock

Since I see more than 2 system user entries in the processlist, I would assume you are using Multi-Threaded Replication (slave_parallel_workers > 1).

That looks like a bug

On Oct 29, 2014, this was expressed by David Moss

Thank you for your feedback. This issue was covered in bug 17326020 and the following was added to the MySQL 5.6.21 and 5.7.5 changelogs:

When the I/O thread reconnected to a master using GTIDs and multithreaded slaves while in the middle of a transaction, it failed to abort the transaction, leaving a partial transaction in the relay log, and then retrieving the same transaction again. This occurred when performing a rotation of the relay log. Now when reconnecting, the server checks before rotating the log in such cases, and waits first for any ongoing transaction to complete.

Therefore nothing new will be added to cover this bug and I'm closing it as fixed.

On Dec 10, 2014, this was expressed by Laurynas Biveinis

Problem:

With MTS, GTIDs and auto positioning enabled, when a worker applies a partial transaction left on relaylog by an IO thread reconnection, it will wait for the XID log event to commit the transaction.

Unfortunately, the SQL thread coordinator will reach the master's ROTATE event on the next relaylog file and will wait for all workers to finish their tasks before applying the ROTATE.

Analysis:

As the whole transaction is retrieved again by the IO thread after the reconnection, the slave must rollback the partial transaction once noticing this ROTATE from the master.

This bug reports the same issue already fixed by BUG#17326020, and the reported issue is not reproducible anymore. So, this patch is just adding a new test case.

SUGGESTION

Run FLUSH BINARY LOGS; on the Master

See if the movement triggers a response from the SQL threads.

If it does not, go ahead and remove slave_parallel_workers from my.cnf and restart mysql.

Since you started MySQL up and master and slave and got error 1236, that means you are trying to establish replication from an impossible position. In the context of GTID and error message you got, the binary logs needed to fully identify a set of queries within a GTID set no longer exists,

Look back at your SHOW SLAVE STATUS\G

Retrieved_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:4757140-5030085
 Executed_Gtid_Set: 7846a847-62c7-11e5-91a6-e06995de432e:1-4783274

From this, the last GTID executed is 7846a847-62c7-11e5-91a6-e06995de432e:4783274

This means that the binary log that has or had 7846a847-62c7-11e5-91a6-e06995de432e:4783275 no longer exists.

I can see this happening if you stopped replication on the Slave, left replication off long enough for the Master to rotate its binary logs (via expire_logs_days) the slave still needed to see, then turned on replication.

In your particular case, try doing a mysqlbinlog dump of the binary log mysqld-bin.000141. If nothing comes out of it, you will have to reload the Slave and setup replication from scratch.

Best Answer

Related Solutions

MySQL slave replicates changes that are in neither binlog_do_db nor replicate_do_db

MySQL 5.6 – GTID Replication Slave Stuck with System Lock

SUGGESTION

Related Question