Mysql – High disk IO with MySQL master-master replication

master-master-replicationMySQLmysql-8.0replicationsemi-sync-replication

We've been evaluating an active-passive master-master replication setup for our MySQL database, running on version 8.0. While diagnosing some general slowness issues, we were surprised to see very high disk IO ranging from 500MB/s – 1500MB/s on both servers.

I've found that stopping the slave on either server brings the disk IO back to expected levels, 15MB/s – 30 MB/s, and I'm concerned that we have a misconfiguration somewhere.

Here is what I've checked:

  • The server ID's are unique.
  • We are doing row-based replication.
  • We are using the semi-synchronous replication plugin.
  • The writes are happening in an LVM volume that exclusively houses the data directory.
  • vmstat shows no swapping.
  • From the network perspective, we see a few megabits per second going between the servers at most.

Here is the redacted configuration:

[mysqld]
...

# General
server_id                        = <unique_between_servers>                         
event-scheduler                  = ON                              
sync_binlog                      = 1                               
log_bin                          = /var/lib/mysql/mysql-bin        
binlog_format                    = ROW                             
innodb_file_per_table                                              
innodb_flush_log_at_trx_commit   =  1                              
skip-external-locking                                                                                                


# Replication
relay_log                        = /var/lib/mysql/mysql-relay-bin
relay_log_recovery               = ON
log_slave_updates                = 1
sync_master_info                 = 1
sync_relay_log                   = 1
sync_relay_log_info              = 1
rpl_semi_sync_master_enabled     = 1
rpl_semi_sync_master_timeout     = 10000    # 10 seconds
rpl_semi_sync_slave_enabled      = 1


# Fine Tuning
key_buffer_size          = 16M
max_allowed_packet       = 16M
thread_stack             = 192K
thread_cache_size        = 8
myisam-recover-options   = BACKUP           
max_connections          = 800
max_connect_errors       = 1000
innodb_buffer_pool_size  = 192G    # Server has 256GB RAM
innodb_log_buffer_size   = 256M
innodb_log_file_size     = 2G
innodb_flush_method      = O_DIRECT
innodb_lock_wait_timeout = 60
...

My problem sounds suspiciously similar to this person's dead thread from 2013 (Who were you, DenverCoder9? WHAT DID YOU SEE!?)

EDIT:

In an effort to zero in on the problem, we simplified the setup for testing. We used one master and one slave, both running 8.0.11, same configuration as above, and we saw the following write patterns according to iostat (master is blue, slave is red):

Master-Slave IO

(The "Time" on the X-axis is the minutes and seconds of the current time.)

Those per-minute 1+ GB spikes on the master were tracked down to a script updating two small tables (1000-2000 rows each), around 10-15 MB in size on disk. These tables had keys, but the keys were not primary or unique. The script would DELETE all rows in the tables and then repopulate them by performing one INSERT statement per row, all within a transaction. Despite the glaring problems, I still don't understand how operations on tables this small can result in this much disk IO.

Without fixing the problems, we then returned to the master-master active-passive setup and saw the following (active master is blue, passive master is red):

Active-Passive Master-Master IO

So some additional questions I have are this. Why is the slave consistently writing much more than the master in the first graph? Also, why in master-master replication is there the ping-pong effect seen above where writes to the master result in heavy IO on the slave, which results in heavy IO again on the master?

Best Answer

  • ALTERs being replicated? They could be low replication bandwidth but high I/O.
  • How much RAM? innodb_buffer_pool_size is quite large.
  • Do your tables have lots of secondary indexes?
  • Was there previously a lot of activity that led to updating secondary, non-UNIQUE, indexes? The Change Buffer is in the buffer_pool and it is flushed asynchronously. See innodb_change_buffer_max_size, but don't necessarily change it.
  • Lots of TRIGGERs?
  • How big are the tables?
  • In spite of ROW, might a full-table UPDATE have gone through as STATEMENT?
  • LVM's COW may double (or triple?) the I/O activity. (But this would only partially explain what you are seeing.)
  • Are some big SELECTs happening? (Turn on the slowlog, with a small value for long_query_time.)

No, I don't see any configuration changes to make.

More

Some issues that may be relevant (especially after updates to the Question)

  • For a single UPDATE of 1K rows, Row Based Replication is rather verbose. Each row is a separate record in the replication stream, and it is rather large (compared to the update, itself). These records will be buffered and written to the Master's disk (binlog) and to the Slave's disk (relay log). I would assume that they are buffered, etc, so that the amount of writing is the same. But maybe I am wrong.
  • The "sql thread" on the Slave reads all of these 1-row updates, presumably one at a time. Again, one would hope that there is buffering such that this does not add to the I/O.
  • Both machines need to update any secondary keys via the Change buffer. However, this is a delayed action. In fact, it might be delayed so long that it won't show up on the graphs you have presented.
  • In a Master-Master setup, I do not know when the "loop" is stopped. That is, server_id 123 replicates to its Slave, server_id 234, which then replicates to all of its Slave, including 123. Upon discovering 123=123, the replication packets are dropped. But does that happen before getting back to 123 and being written to the relaylogs there? I don't know. You can figure it out by looking at the growth of the relaylogs on the active Master.
  • What other logs does the Slave have turned on? Presumably the same as the Master. General log? Slow log? Others?

Scenario details

  • SET autocommit=0;
  • DELETE all of 1-2K rows in a particular table. If all at once, and Row-based-replication, then a lot is sent through the replication stream.
  • INSERT one row back in
  • INSERT one row back in
  • ... 1-2K rows inserted
  • COMMIT; It is not until this point that anything goes across to the Slave. I suspect this is when the binlog is being written all at once.

Well, I guess this still does not answer your question, but maybe it provides some more insight.

How to replace a table

This is a more efficient way to replace all of a table. It avoids the table ever being out of service:

CREATE TABLE new LIKE real;
load the new data into the table (LOAD DATA or multi-row INSERT(s) is best)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;