Mysql – Why does our MySQL server become non-responsive with the processlist showing tons of processes waiting in STATE=init, INFO=commit

innodbMySQLmysql-5.6query-performance

I work for a company that's been around for a while and has a large MySQL (5.6.48) monolith running in RDS. Recently, the database has starting going unresponsive for 10-30 minutes at a time during peak traffic time. This often happens 3 or 4 times throughout the peak traffic hours.

During these unresponsive times, it is almost impossible to open a connection with the database (timeouts are the most common response). If you get a connection, queries normally perform as expected for a short while. The processlist shows dozens to hundreds of items with state "init" and info "commit". Row operations drop nearly to zero and stay there for minutes on end until the database suddenly begins to recover, the processlist clears, and it becomes responsive to traffic once again.

What we have attempted so far:

  • We have tried to remove all slow queries during these peak hours, shutting down large swaths of functionality and workers during these times.
  • We have increased the redo log size in innodb.
  • We have looked for lock contention and deadlocks.
  • We have doubled the compute power, memory, and throughput available to the RDS instance. (DB CPU usage during peak hours hovers around 50% and there are no alarming spikes in memory or disk or network usage.)
  • We have tried using proxy servers to hold open long-lived connections to the database.
  • We've looked for any recent changes or new queries introduced in the app.

All of that only helped a little bit. We often still see at least one "storm" where we get stuck in the bad state, and re-enabling any batch jobs tends to push us over the edge.

Has anyone seen the pattern of processes getting stuck in STATE=init, INFO=commit for minutes on end? Does anyone have suggestions on how to proceed with debugging or analysis? What other resource contention could we be running into?

Best Answer

  • You are seeing the victims. Probably only a few of them are the villains.
  • Look at "Time". A few 'system' processes may have very large times; look at the next couple -- they may be the ones that started the problem
  • How big is max_connections? If it is more than a few dozen, it is inviting a log jam.
  • When a lot of threds are running (check Threads_running), they tend to stumble over each other, waiting for resources (CPU, I/O, buffer_pool space, table cache space, etc). Meanwhile, the allocation mechanism is "playing fair". That is, it is given each player a little of what it needs, so as to give a fair amount to the others. Net effect: Latency goes through the roof and throughput flattens or declines.

Cures:

  • Check for swapping -- Swapping is terrible for performance, and it may be a factor in what you are seeing.
  • Lower (yes, lower) max_connections.
  • Lower (yes, lower) the number of web server clients that can run simultaneously.
  • Use the slowlog (with a low value of long_query_time to identify both the long-running queries and the fast, but frequently running, queries. Improve both. It is usually easier to give the end-user a civilized message "System is busy, DO NOT HIT RELOAD", than to let things clog up and "never" finish.
  • Use Replicas to spread the readonly queries across multiple servers.