Mysql – Large No of MySQL running threads killing response times

innodbMySQLmysql-5.6

Yesterday, our MySQL's response time increased 20 times- select queries which earlier used to take 0.6-0.7 ms took 20ms, and updates even more- 60ms.

The MySQL running threads count which usually stays around 10-15 shot up to 300-400, for a good five hours (before coming down when traffic cooled).

Server version: 5.6.25-log MySQL Community Server (GPL), with 32 core machine, 128GB machine (90 GB buffer pool)

Storage engine is Innodb.

Some output from show engine innodb status captured by ptstalk-

SEMAPHORES
———- OS WAIT ARRAY INFO: reservation count 159291194 OS WAIT ARRAY INFO: signal count 1074930257 Mutex spin waits 1220535470,
rounds 2921695607, OS waits 26779994 RW-shared spins 291539762, rounds
1856674850, OS waits 31197621 RW-excl spins 300274870, rounds
5700270052, OS waits 95339247 Spin rounds per wait: 2.39 mutex, 6.37
RW-shared, 18.98 RW-excl
———— TRANSACTIONS
———— Trx id counter 20746118705 Purge done for trx's n:o < 20746118666 undo n:o < 5 state: running but idle History list length
1154

Also,there are large number of os waits, specifically at

InnoDB dict0dict.cc:2606 os_waits=60628 InnoDB
dict0dict.cc:2606 os_waits=131984 InnoDB dict0dict.cc:2606
os_waits=81342 InnoDB dict0dict.cc:2606 os_waits=230958

Also at:

InnoDB trx0rseg.cc:196 os_waits=491 InnoDB trx0rseg.cc:196
os_waits=458 InnoDB trx0rseg.cc:196 os_waits=489 InnoDB
trx0rseg.cc:196 os_waits=483 InnoDB trx0rseg.cc:196 os_waits=459

Some other information:
1. Query Cache:

| have_query_cache             | YES                               |
| long_query_time              | 0.500000                          |
| query_alloc_block_size       | 8192                              |
| query_cache_limit            | 10485                             |
| query_cache_min_res_unit     | 4096                              |
| query_cache_size             | 0                                 |
| query_cache_type             | OFF                               |
| query_cache_wlock_invalidate | OFF                               |
| query_prealloc_size          | 8192

Adaptive Hash Index is also ON.

What could the issue be? Please let me know if you need more details.

Vmstats snapshot from that time:

procs   -----------memory----------   ---swap--  -----io----  --system-------cpu-----
 r  b   swpd   free    buff   cache    si   so    bi    bo    in     cs us sy id wa st
17  0 685588 2254064 310852 23913820    0    0    96   404     0      0  3  1 97  0  0     
 5  0 685588 2256376 310856 23915576    0    0  6816 14092 53915 142794 18  3 79  0  0  
 7  0 685588 2256284 310860 23916752    0    0  6288 13136 51433 143163 15  2 82  0  0  
 7  0 685588 2254044 310860 23917792    0    0  7904 18572 52465 142161 16  2 82  0  0  
 8  0 685588 2254044 310860 23918800    0    0  6512 18244 51958 146916 15  2 82  0  0  
 6  0 685584 2252828 310860 23919844    4    0  5444 16820 51750 147625 16  2 82  0  0  
 5  0 685584 2251820 310860 23920884    0    0  6864 14084 53609 146373 15  2 82  0  0  
 6  0 685584 2251952 310860 23921996    0    0  6016 13112 52862 147601 15  2 82  0  0  
10  0 685584 2250164 310864 23923056    0    0  6096 14468 52406 147109 15  2 82  0  0  
 7  0 685584 2248936 310864 23924112    0    0  6576 13664 51638 142051 15  2 83  0  0

Best Answer

Query cache off -- good

Large history list -- Innodb stumbling over itself.

Latency suddenly shot up -- ditto

Thread count suddenly shot up -- ditto.

Possible causes:

A long-running UPDATE or ALTER blocking lots of other threads. Or even a SELECT that was poorly written. If the slowlog was on, you may be able to discover what it was. long_query_time = 0.5 implies that (if the slowlog was turned on) it will have caught the naughty query. And there would be a lot of normally-fast queries in the log, too.
Too many threads. What is "too many"? When this happens. How to control it? Decrease the limit on number of connections, either in the clients, or with max_connections. I know that sounds like a bad remedy, but when MySQL gets into the state you described, it may take a restart to uncork it. Sharing resources among 300-400 threads means that no one will 'ever' get finished. How many clients? Web server(s)? What is the limit on each? Example: Apache's MaxClients.
Swapping. Do you have any monitoring that would say that there was swapping? 90G/128G sounds fine, but perhaps something else was eating into RAM. Swapping is terrible for MySQL.

Related Solutions

Mysql – Internal reason for killing process taking up long time in thesql

The reason the kill takes so long is most likely due to the rollbacks issued by the innodb transaction. From InnoDB performance tips:

Beware of big rollbacks of mass inserts: InnoDB uses the insert buffer to save disk I/O in inserts, but no such mechanism is used in a corresponding rollback. A disk-bound rollback can take 30 times as long to perform as the corresponding insert. Killing the database process does not help because the rollback starts again on server startup.

Edit: The innodb force recovery methods might be of use to you (glad you did this on a test environment)

You can kill the mysqld process and set innodb_force_recovery to 3 to bring the database up without the rollback, then DROP the table that is causing the runaway rollback.

And next time, insert a small subset of data each time. 10 million rows shouldn't take that long to insert, but there could be multiple reasons it does. Without knowing your environment I can't offer advice on that.

Mysql – Client times out, while MySQL query remains running

You need to look at what default values are in place for timeouts:

mysql> show variables like '%timeout';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| connect_timeout            | 10    |
| delayed_insert_timeout     | 300   |
| innodb_lock_wait_timeout   | 50    |
| innodb_rollback_on_timeout | OFF   |
| interactive_timeout        | 60    |
| net_read_timeout           | 30    |
| net_write_timeout          | 60    |
| slave_net_timeout          | 3600  |
| table_lock_wait_timeout    | 50    |
| wait_timeout               | 60    |
+----------------------------+-------+
10 rows in set (0.00 sec)

Usually, I watch for several timeout variables. This is very imperative if you use MySQL remotely from MySQL Workbench, mysql client, or PHP app on an app server contacting MySQL on a DB Server.

Here is what the MySQL Documentation says one these settings:

wait_timeout (Default 28800 [8 hours]) : The number of seconds the server waits for activity on a noninteractive connection before closing it. This timeout applies only to TCP/IP and Unix socket file connections, not to connections made using named pipes, or shared memory. On thread startup, the session wait_timeout value is initialized from the global wait_timeout value or from the global interactive_timeout value, depending on the type of client (as defined by the CLIENT_INTERACTIVE connect option to mysql_real_connect()). See also interactive_timeout.
interactive_timeout (Default 28800 [8 hours]) : The number of seconds the server waits for activity on an interactive connection before closing it. An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect(). See also wait_timeout.
net_read_timeout (Default 30) : The number of seconds to wait for more data from a connection before aborting the read. When the server is reading from the client, net_read_timeout is the timeout value controlling when to abort. When the server is writing to the client, net_write_timeout is the timeout value controlling when to abort. See also slave_net_timeout.
net_write_timeout (Default 60) : The number of seconds to wait for a block to be written to a connection before aborting the write. See also net_read_timeout.

Please make sure these timeouts are set high enough to accommodate queries that may run for a very long time, which may include:

Mass UPDATEs
Mass DELETEs
ENABLE KEYS on a Large MyISAM

To deal with queries that keep running after you lose touch with it, you have to run KILL against the process ID of the long running query. Even with the KILL command, you will have to wait for any query that is in the middle of disk-intensive steps or have internal mutexes in progress.

Best Answer

Related Solutions

Mysql – Internal reason for killing process taking up long time in thesql

Mysql – Client times out, while MySQL query remains running

Related Question