MySQL Replication binary relay logs transfer too slowly

innodbmaster-slave-replicationMySQLreplication

Master server

MySQL 5.6.28

Ubuntu 14.04

8 core CPU

Slave server

Slave: MySQL 5.7.22

Ubuntu 18.04

16 core CPU

Both of them have 64GB of RAM and plenty of disk space.

What I did up to this point

I made a dump of the master database, copied it over to the slave server and set up a Slave database there. The replication works, but it's too slow.

The slave started approx. 2.5 days after the initial dump and isn't catching up. Looking at the relay log files, it seems that they're filling up too slowly (approx 1MB every few seconds). This is on a 100GB+ database.

I've tried

Checked the disk io with iotop – They're good on both the Master and the Slave. They're not SSDs but they don't seem to be the bottleneck.

Checked network speeds with bmon – they're barely scratching the surface. Both machines are on a Gigabit network. I've tried running scp (with the replication in progress) and I'm getting up to 100MB/s transfer. The relay logs seem to be transfering at less than 1MB/s.

Checked the CPU – both servers have plenty to spare.

I made sure the innodb settings are the same on both servers. All tables are innodb.

Looking at SHOW SLAVE STATUS\G I see that most of the time is spent waiting for new relay logs to transfer. There is no delay on the SQL side, it's always caught up.

Worth mentioning that the binlog_format is ROW.

tl;dr:

What could be limiting the speed at which the binlogs are relayed from Master to Slave, if there is still plenty bandwidth, CPU and disk I/O available?

EDIT 1:

Results of SHOW SLAVE STATUS:


Slave_IO_State: Waiting for master to send event
Master_Host: master.server
Master_User: sqlslave
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.285479
Read_Master_Log_Pos: 87535361
Relay_Log_File: slave-relay-bin.001588
Relay_Log_Pos: 87535479
Relay_Master_Log_File: mysql-bin.285479
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 87535282
Relay_Log_Space: 87535812
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 41260
Master_UUID: 2b9b59f9-4290-11e5-bf92-0cc47a02cb8e
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:

EDIT 2:

I'm leaning towards a combination of disk or network so I've been running iostat and iotop. The disk that gets the binlogs written to on the master is always near 100% utilization, but the "wait" is only about 10ms. Still, when I run scp it runs quickly, so I'm not sure why transferring the binlogs is about 200x slower.

There are also no errors on the Slave side, I'm monitoring the logs. Binlogs are disabled on the slave, and compressed binlogs are enabled.

Queries on the Slave are running fast, it's just that it keeps waiting on new relay logs to be transferred.

Best Answer

Ok, in my case the constraint was the Disk I/O. The binlogs are written to the hard disk which slows the whole thing down a bit.

The reason I couldn't see this at first was that MySQL was not using the Disk I/O to it's full potential, possibly due to settings or "niceness" of the MySQL process.

In the end it took a few days for the Slave to get caught up, and now it's fine. But potential, better solutions in the future:

One or many of the above might help fix similar issues.