MySQL V8 replication lagging waiting for handler commit

MySQLperformancereplication

let me start with some background.

We've decided to upgrade our Master DB to a more performant machine. We got a server with 1.5TB of RAM, Quad Intel(R) Xeon(R) Gold 6254 and IO is all NVMe drives. Our current environment is on MySQL 5.7.30 which is what I installed on this new server. I activated it as a READ server first just to see how it behaves. I based my config on existing master server with added slave flags. After a few days we get complaints of random slow downs in our main application. What we discovered is that slave was lagging but that drop tables seemed to be locking the environment. Took it off the read pool to see if I can figure out the issue

After doing some research we thought the issue was this bug. We were always planning on upgrading to V8 so figured I may as well do it now and get it over with. We upgraded to V8.0.22 and again re-activated the new server on the network. Again after a few days it starts to slow down dropping tables and inserts. Since this is V8 we're seeing a lot of waiting for handler commit. After researching that state we found some docs saying it may be range_optimizer_max_mem_size the issue. We set it to 0, restarted mysql and waited. A day later slow downs again.

Started looking at the OS (centos 7.8), maybe open files is not enough. Looked and we did set it to over 1M open files. We did notice that we set the rule on the mysql user but systemctl has it's own setting. Set it at systemctl level (/lib/systemd/system/mysqld.service && /etc/systemd/system/mysqld.service.d/limits.conf) and restarted mysql. A day passes and slow down re-appears. Now though the server can't even keep up with the replication let alone additional traffic.

I thought it was my multi threaded replication (as we have 4 primary databases) so I dropped it to single threaded. All that did was make show slave status always show Slave_SQL_Running_State: waiting for handler commit.

At this stage I'm just stumped at what's happening. The new server replication just falling behind more and more and show full processlist always has queries with either waiting for handler commit or checking permissions. What seems the slowest is drop tables though. I can see a drop tables hanging in the process list for 3-5 seconds

State: waiting for handler commit
Info: DROP TABLE IF EXISTS crm.invalidPhoneNumbersCompiled_TMP,crm.invalidPhoneNumbersCompiled_OLD /* generated by server */

We also found this on stackoverflow and went ahead and set all the variables on our server but same result.

On my PMM I can't see anything out of the ordinary on either OS overview or MySQL overview. The server is practically idle and replication just lagging. IO is idle 99.9% of the time and the load is not even a blip on it.

my.cnf

INNO Engine Status

Sample log

UPDATE 1

Here's the status
Here are the variable

Replication finally caught up as the load of the day subsided. Table size doesn't seem to impact drop length. A lot of the time it's just a temp table as an intermediate for a batch update.

UPDATE 2

File limits was one of the first things I looked at. Here's the ulimit data

 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 6184343
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 100000
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Also since it's on centos7, here's the systemctl limits (I tried 100K all the way to 10M)

[Service]
LimitNOFILE=10000000

Here is /etc/security/limits.conf

mysql soft nofile 200000
mysql hard nofile 10000000

Here is lsof -u mysql | wc -l => 6200 and slave lag is at 3635 seconds behind master

Another interesting thing is last night around 17h it decided to unblock and what we saw was network traffic increase on it

UPDATE 3

Replica Status

UPDATE 4

Process Limits

UPDATE 5

Came in this AM and server was 12000 seconds behind master and the bandwidth usage was as usual almost nothing. Around 11AM it unblocked (no idea why) and replication started catching up

UPDATE 6

Working with some outside consultants on the issue and I think we've at least identified the problem. It's definitely an INNODB issue and pretty sure it involves TABLESPACE lookups. Take a look at this to see the problem in it's full glory. Notice that dropping and creating MyISAM tables is instant, doing it on INNO tables 3-10 seconds. Then the best part, disable bin log and do the create/drop INNO and INSTANT. The one difference between all my other read servers vs this one is that they are all purely read and don't log binary (just relay). I've disabled BIN LOG on this one and restarted it now, flying for now. Will monitor for 2-3 days as it's never gone that long without crapping out. At least I know where the issue is at even if I don't know why yet

UPDATE 7

So if anyone else has a similar issue, I'm pretty sure I figured it out. If you search drop table innodb bugs you'll see issues since 2008. After reading code and finding a super helpful article from Percona I think I figured out what is happening. INNO has an adaptive hash index which is supposed to help speed up queries making INNO more like an in memory DB. Issue is on a drop table INNO needs to ensure all transactions with that table are complete before executing the drop. When it does this it actually locks the entire DB (many bug docs say they resolved this but as of 8.0.22 it's exactly the issue I was seeing) and since it's locked you get replication latency. Interestingly enough this issue is on all of my servers but for some reason way more evident on my big ram server. Disabling innodb_adaptive_hash_index automatically caused the server to start catching up. I am now running replication + binlogging on the slave for over 2 days with 0 lag. I am waiting until mid next week to yell victory over this bug.

On my other servers the issue is weirder and harder to find. You can do 20 drop/creates of simple 1 column table and randomly 1 will take 3-4 seconds. It's why I never noticed it on any of my other servers. I think 64 buffer pool instances and over 200GB inno buffer makes the issue visible. Might be due to the fact I have thousands of tables also. Interestingly enough when I put my buffer pool to 128GB I still saw the issue and that was due to bin logging. Replication + Binlogging with adaptive hash on makes the issue super visible. I am left wondering if innodb_adaptive_flushing also slows things down? For now it's on and everything seems to work.

If by mid week I see no lag I will mark this issue resolved 🙂

Best Answer

Rate Per Second = RPS

Suggestions to consider for your my.cnf [mysqld] section

innodb_io_capacity=30000  # from 6000 for higher IOPS with your NVME devices

key_buffer_size=4G  # from ~ 64G since only 2% has been used
key_cache_age_threshold=7200   # from 300 seconds before age out
key_cache_block_size=16384  # from 1024 to minimize number of blocks to manage
key_cache_division_limit=50  # from 100 for Hot/Warm cache

these changes will reduce key_reads well below current 29 reads per second. For additional suggestions view my profile, Network profile for contact info and free downloadable Utility Scripts to assist with performance tuning. Tables are being scanned 7 times per second, indexes may provide relief. EXPLAIN SELECT (query) will help identify missing indexes.

Related Solutions

MySQL Replication Lag Behaving Erraticly

MySQL Replication performs weird calculations for Seconds_Behind_Master based on a few things.

What NOW() returns for SQL run on the master as recorded in relay logs
What NOW() returns on the slave
The last position from the master that was executed

Here is a SHOW SLAVE STATUS\G and how to identify the Last Executed SQL from the Master

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.17.20.102
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.002814
          Read_Master_Log_Pos: 823078734
               Relay_Log_File: relay-bin.007364
                Relay_Log_Pos: 823078879
        Relay_Master_Log_File: mysql-bin.002814
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 823078734
              Relay_Log_Space: 823079071
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

Notice the following fields in SHOW SLAVE STATUS\G

Master_Log_File (Line 6) : Log File on the Master whose position was last read
Read_Master_Log_Pos (Line 7) : Last position read on the Slave from the Master
Relay_Master_Log_File (Line 10) : Log File on the Master whose position was last executed
Exec_Master_Log_Pos (Line 22) : Last position executed on the Slave from the Master
Relay_Log_Space (Line 23) : Sum of bytes from all relay logs

To calculate Seconds_Behind_Master, it would essentially figured out

NOW() as recorded in Master LogFile 'Master_Log_File' LogPos 'Read_Master_Log_Pos'
minus
NOW() on the Slave

This number can intermittently grow because the I/O thread can collect more entries from the Master and loading it into the last relay log while processing the SQL statement located at Master LogFile 'Relay_Master_Log_File' and Master LogPos 'Exec_Master_Log_Pos'. This increasing number is also manifested by three(3) variables changing: Relay_Log_Space, Master_Log_File, Read_Master_Log_Pos. If Seconds_Behind_Master is 0, this is good indication that Read_Master_Log_Pos and Exec_Master_Log_Pos virtually the same. Also worth noting is the fact that long running SQL can fool you into thinking that replication is behind. Once that SQL statement is complete and no other statements have sufficiently backlogged, Seconds_Behind_Master can drop dramatically, even to 0.

So, why would Seconds_Behind_Master bounce back and forth between 0 and an increasing number. It is possible for network latency in transmitting SQL through the I/O thread to contribute to miscalculation because the needed TIMESTAMP variable was not reached yet in the back of the last relay log (remember this is asynchronous replication). To verify this, run mysqlbinlog against any binary log or relay log and see where the TIMESTAMP variables are written between statements.

This behavior was first addressed June 22, 2007 and the bug was supposedly fixed then.

Here is one sure way to take MySQL Replication and screw its head on correctly (For this example, use the same SHOW SLAVE STATUS\G)

Using Relay_Master_Log_File mysql-bin.002814 and Exec_Master_Log_Pos 823078734 as the Log FilePosition to Restart From, run these commands

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.002814',MASTER_LOG_POS=823078734;
START SLAVE;

These steps should 1) kill both I/O and SQL Threads, 2) wipe out All Collected Relay Logs, 3) start with a New Relay Log, and 4) establish New I/O and SQL Threads.

Replication should be good to go from here. As to whether or not this bug can truly be addressed, this now in Oracle's hands (That last statement does not inspire any confidence, does it ???)

MySQL, optimize and replication

Your First Question

why does a OPTIMIZE query block the replication catching up ?

The SQL thread in MySQL Replication processes SQL commands from the Relay Logs as a FIFO queue (i.e. first-come, first-server basis). Once the SQL thread see a command, it processes it to completion. Meanwhile, the Seconds_Behind_Master just keeps creeping up.

While the SQL thread on the Slave is running OPTIMIZE TABLE, once you run STOP SLAVE; it SQL thread has complete the running OPTIMIZE TABLE before the SQL thread can terminate. That's why STOP SLAVE; hangs. To verify this, open another mysql session and run SHOW PROCESSLIST;, You should see a line that has OPTIMIZE TABLE with the status Killing slave....

Your Second Question

can I prevent those OPTIMIZE queries ? I didn't put any cron to do that, and as far as I can see, nothing in /etc/cron.{d,daily,hourly,weekly} does this.

Someone or something is running the OPTIMIZE TABLE. You have to hunt it down in all your jobs or PHP apps. Use the mysqlbinlog utility to dumps all the binary logs into a text file. Then, grep the text file for OPTIMIZE TABLE or optimize table. The timestamp of when it was submitted into the binlogs will be a few lines about the command.

When you do find that OPTIMIZE TABLE command in your jobs or PHP code, simply run this:

SET SQL_LOG_BIN=0; OPTIMIZE TABLE tblname; SET SQL_LOG_BIN=1;

This will not register the OPTIMIZE TABLE in the binlogs, thus preventing a Slave from replicating it.

As was just pointed out in the comments below, You can also replace

OPTIMIZE TABLE tblname;

with one of the following:

OPTIMIZE NO_WRITE_TO_BINLOG TABLE tblname;
OPTIMIZE LOCAL TABLE tblname;

to restrict the OPTIMIZE TABLE command from landing in the binlogs and thus prevent replication of it.