MySQL stalls and strange threads/connections behavior is observed

connection-poolingconnectionsmax-connectionsMySQL

We have a MySQL instance which most of the time works just fine. From time to time, however, we observe strange behavior. What happens is that suddenly MySQL connections reach the limit (currently set to 1000) and MySQL pretty much stalls. The following is observed when we check the graphs for this server:

enter image description here

The image is from a Grafana instance running a MySQL exporter. What troubles me is why threads from the cache are not reused. The other question is why the threads in cache suddenly drops to zero but at the same time no new threads are created (according to the second graph – blue line stays at 0).

Ultimately, my goal is to understand why 1000 connections are used all of a sudden and if this is a problem with MySQL or the app. This is what the MySQL docs say in regards to threads:

How many threads the server should cache for reuse. When a client
disconnects, the client's threads are put in the cache if there are
fewer than thread_cache_size threads there. Requests for threads are
satisfied by reusing threads taken from the cache if possible, and
only when the cache is empty is a new thread created.

Any assistance or guidance will are highly appreciated!

Best Answer

Thundering herd. Or, so many people in a grocery store that no one can move their cart.

When Threads_running is 349, each of them is waiting and waiting and waiting for a tiny slice of CPU and a tiny amount of I/O and ... The effect is that they all take 'forever' to finish. Meanwhile, others trying to get in are let in because you have such a high max_connections, etc.

But, that does not explain what pushed the herd over the cliff. It could be something as trivial as a big SELECT that happens to block what the herd normally does.

Do you have the slowlog turned on? With a low value of long_query_time? If so, you may already have the answer in the slowlog. Since you did not panic and reboot, that slow Select (if it really exists) will be in the slowlog. Based on the graphs, I would guess it might have a time of 80-110 seconds.

As for the dip in Threads_cached, that gets a bit complex. First, let me explain the non-dip parts. Before the 'event', about 200 connections were made; each lasted long enough for there to be a 'steady' 200 connected.

After the event, there were so many waiting to get in that it pushed the number up to about 350.

Having thread_cache_size = 500 wastes memory; suggest you lower it to 250.

As for the dip, I will do some guessing. So many connections were trying so hard to get things done that new connections were denied access. (This may be deliberate on the part of MySQL, or it may be a side effect of everyone clambering for various mutexes.)

It is also unwise to have max_connections = 1000; this just invites scenarios like this. Instead, have the clients 'take the heat' when something hiccups -- by being denied access to MySQL. Furthermore, this makes it easier to say "Oops, were are having a problem please wait, and please do not press Send again, nor Refresh!" Or simply give a "500".

Bottom line: I think the slowlog can point to the cause of the event.