MySQL replication slave VERY slow

MySQLmysql-5.6replication

I am working on a replication problem that completely baffles me! This client has two MySQL replication clusters on HUGE bare metal HW. See Environment below.

The IO_Tread of the slave is WAY behind, several hours or more. Yes, the IO_tread, not the SQL_tread. Why is it so hard to download not all that large binlog records and write them to disk. I tried finding a resource bottleneck, but given the massive HW none can be found.

The only strange observation is that the slave has 8x the IO OPS than the master. But even this does not really overload the SSD disk. Packet traces show the SLAVE often setting the TCP window to zero. Why, there are plenty resources?

Anyone with ideas what could be causing this strange behaviour? Why do I have more IO on the slave? What can cause the IO_tread to slow down?

Environment:
Both machines:
Bare metal DELL, MySQL 5.6.30, 12CPUs, 128GB mem,
datadir on SSD, Net I/F: Emulex 10Gb, ROW based binlog FMT

Symptoms:

MASTER:
CPU: 67% 1 processor lightly used, MEM: 70% used, 30% free,
IO OPS: ~2500 tps, 30% util on SSD, slave client tread: Send binlog to slave.

SLAVE:

CPU: 40% 1 processor lightly used, MEM: 70% used, 30% free, IO OPS: ~16000 tps, 70% util on SSD, Error counters on net I/F are 0 (zero),
TCP window is often set to 0 on IO_tread,
Slave IO_tread is VERY slow. Lags more than an hour!

Another slave on SAME master has no trouble at all! This slave has much lower HW spec!

Trouble downloading master binlog.
Why this insanely high IO rate?

Stopping the slave also stops the IO OPS. (As expected, the OPS are from MySQL)

Copying large amounts of data from master to slave over the network (using ncat) show performance as expected.

Other observations:

When reversing the roles the problem stays the same.

Another replication cluster with same HW has no trouble. IO OPS in this cluster on slave are slightly less than on master. This cluster uses STATEMENT based binlog

Best Answer

If the I/O thread is behind, then the network is slow.

If the SQL thread is behind, then it could be the serial nature of replication (unless you have a new version) or contention from SELECTs, or disk I/O, or hardware differences (generally, a slave should be at least as powerful as the master), etc.

If you are doing huge UPDATEs or DELETEs, then Row Based Replication puts lots of stuff in the replication stream (binlog). Could this be relevant?

"MASTER: CPU: 67% 1 processor lightly used" -- Even 67% of one Core is rather high. Perhaps missing some needed composite indexes?

Is the Slave doing lots of SELECTs that could be interfering with replication?

What are the values of SHOW VARIABLES LIKE 'query_cache%'; on both machines? Don't set `query_cache_size bigger than, say, 50M, regardless of the amount of RAM.

How many GB of binlogs are created per hour?