Mysql – Why does our MySQL server become non-responsive with the processlist showing tons of processes waiting in STATE=init, INFO=commit

innodbMySQLmysql-5.6query-performance

I work for a company that's been around for a while and has a large MySQL (5.6.48) monolith running in RDS. Recently, the database has starting going unresponsive for 10-30 minutes at a time during peak traffic time. This often happens 3 or 4 times throughout the peak traffic hours.

During these unresponsive times, it is almost impossible to open a connection with the database (timeouts are the most common response). If you get a connection, queries normally perform as expected for a short while. The processlist shows dozens to hundreds of items with state "init" and info "commit". Row operations drop nearly to zero and stay there for minutes on end until the database suddenly begins to recover, the processlist clears, and it becomes responsive to traffic once again.

What we have attempted so far:

We have tried to remove all slow queries during these peak hours, shutting down large swaths of functionality and workers during these times.
We have increased the redo log size in innodb.
We have looked for lock contention and deadlocks.
We have doubled the compute power, memory, and throughput available to the RDS instance. (DB CPU usage during peak hours hovers around 50% and there are no alarming spikes in memory or disk or network usage.)
We have tried using proxy servers to hold open long-lived connections to the database.
We've looked for any recent changes or new queries introduced in the app.

All of that only helped a little bit. We often still see at least one "storm" where we get stuck in the bad state, and re-enabling any batch jobs tends to push us over the edge.

Has anyone seen the pattern of processes getting stuck in STATE=init, INFO=commit for minutes on end? Does anyone have suggestions on how to proceed with debugging or analysis? What other resource contention could we be running into?

Best Answer

You are seeing the victims. Probably only a few of them are the villains.
Look at "Time". A few 'system' processes may have very large times; look at the next couple -- they may be the ones that started the problem
How big is max_connections? If it is more than a few dozen, it is inviting a log jam.
When a lot of threds are running (check Threads_running), they tend to stumble over each other, waiting for resources (CPU, I/O, buffer_pool space, table cache space, etc). Meanwhile, the allocation mechanism is "playing fair". That is, it is given each player a little of what it needs, so as to give a fair amount to the others. Net effect: Latency goes through the roof and throughput flattens or declines.

Cures:

Check for swapping -- Swapping is terrible for performance, and it may be a factor in what you are seeing.
Lower (yes, lower) max_connections.
Lower (yes, lower) the number of web server clients that can run simultaneously.
Use the slowlog (with a low value of long_query_time to identify both the long-running queries and the fast, but frequently running, queries. Improve both. It is usually easier to give the end-user a civilized message "System is busy, DO NOT HIT RELOAD", than to let things clog up and "never" finish.
Use Replicas to spread the readonly queries across multiple servers.

Related Solutions

MySQL – Should MyISAM Be Replaced or Should Better Disks Be Used?

You have to use InnoDB. Here is why :

The major advantages of InnoDB over MyISAM

InnoDB caches data and index pages, whereas MyISAM only caches index pages.
InnoDB is designed for ACID-compliant transactions
InnoDB is designed for row-level locking, MyISAM uses table-level locking.
InnoDB is designed for Multiversion Concurrency Control, critical for transaction isolation
InnoDB can be tuned to engage multiple cores

MyISAM is great in a read-heavy environment. Although you can configre private caches on a per-table basis, MyISAM never caches data.

InnoDB is your best bet !!!

CAVEAT

Get enough RAM on the DB Server to cache everything in the InnoDB Buffer Pool (75% of Installed RAM).

Mysql – Struggling to debug high CPU usage on Amazon RDS MySQL instance

Managed to solve this, these are the steps I followed:

Firstly, I contacted the Amazon RDS team by posting on their discussion forum, they confirmed it was the mysqld process taking up all this CPU - this eliminated a configuration fault with something else running on the physical server

Secondly I tracked down the source of the queries that were running:

SELECT `mytable`.* FROM `mytable` WHERE `mytable`.`foreign_key` = 231273 LIMIT 1

I originally overlooked this as the cause, because none of these queries seemed to be taking particularly long when I monitored the show processlist output. After exhausting other avenues, I decided it might be worth following up....and I'm glad I did.

As you can see in the show processlist output, these queries were coming from a utlility server, which runs some tactical utility jobs that exist outside of our main application code. This is why they were not showing up as slow or causing issues in our new relic monitoring, because the new relic agent is only installed on our main app server.

Loosely following this guide:

http://www.mysqlperformanceblog.com/2007/02/08/debugging-sleeping-connections-with-mysql/

I was able to trace these queries to a specific running process on our utility server box. This was a bit of ruby code that was very inefficiently iterating through around 70,000 records, checking some field values and using those to decide whether it needs to create a new record in 'mytable.' After doing some analysis I was able to determine, the process was no longer needed so could be killed.

Something that was making matters worse, there seemed to be 6 instances of this same process running at one time due to the way the cron job was configured and how long each one took! I killed off these processes, and incredibly our CPU usage fell from around 100% to around 5%!

Best Answer

Related Solutions

MySQL – Should MyISAM Be Replaced or Should Better Disks Be Used?

Mysql – Struggling to debug high CPU usage on Amazon RDS MySQL instance

Related Question