Mysql – Large No of MySQL running threads killing response times

innodbMySQLmysql-5.6

Yesterday, our MySQL's response time increased 20 times- select queries which earlier used to take 0.6-0.7 ms took 20ms, and updates even more- 60ms.

The MySQL running threads count which usually stays around 10-15 shot up to 300-400, for a good five hours (before coming down when traffic cooled).

Server version: 5.6.25-log MySQL Community Server (GPL), with 32 core machine, 128GB machine (90 GB buffer pool)

Storage engine is Innodb.

Some output from show engine innodb status captured by ptstalk-

SEMAPHORES
———- OS WAIT ARRAY INFO: reservation count 159291194 OS WAIT ARRAY INFO: signal count 1074930257 Mutex spin waits 1220535470,
rounds 2921695607, OS waits 26779994 RW-shared spins 291539762, rounds
1856674850, OS waits 31197621 RW-excl spins 300274870, rounds
5700270052, OS waits 95339247 Spin rounds per wait: 2.39 mutex, 6.37
RW-shared, 18.98 RW-excl
———— TRANSACTIONS
———— Trx id counter 20746118705 Purge done for trx's n:o < 20746118666 undo n:o < 5 state: running but idle History list length
1154

Also,there are large number of os waits, specifically at

InnoDB dict0dict.cc:2606 os_waits=60628 InnoDB
dict0dict.cc:2606 os_waits=131984 InnoDB dict0dict.cc:2606
os_waits=81342 InnoDB dict0dict.cc:2606 os_waits=230958

Also at:

InnoDB trx0rseg.cc:196 os_waits=491 InnoDB trx0rseg.cc:196
os_waits=458 InnoDB trx0rseg.cc:196 os_waits=489 InnoDB
trx0rseg.cc:196 os_waits=483 InnoDB trx0rseg.cc:196 os_waits=459

Some other information:
1. Query Cache:

| have_query_cache             | YES                               |
| long_query_time              | 0.500000                          |
| query_alloc_block_size       | 8192                              |
| query_cache_limit            | 10485                             |
| query_cache_min_res_unit     | 4096                              |
| query_cache_size             | 0                                 |
| query_cache_type             | OFF                               |
| query_cache_wlock_invalidate | OFF                               |
| query_prealloc_size          | 8192                     
  1. Adaptive Hash Index is also ON.

What could the issue be? Please let me know if you need more details.

Vmstats snapshot from that time:

procs   -----------memory----------   ---swap--  -----io----  --system-------cpu-----
 r  b   swpd   free    buff   cache    si   so    bi    bo    in     cs us sy id wa st
17  0 685588 2254064 310852 23913820    0    0    96   404     0      0  3  1 97  0  0     
 5  0 685588 2256376 310856 23915576    0    0  6816 14092 53915 142794 18  3 79  0  0  
 7  0 685588 2256284 310860 23916752    0    0  6288 13136 51433 143163 15  2 82  0  0  
 7  0 685588 2254044 310860 23917792    0    0  7904 18572 52465 142161 16  2 82  0  0  
 8  0 685588 2254044 310860 23918800    0    0  6512 18244 51958 146916 15  2 82  0  0  
 6  0 685584 2252828 310860 23919844    4    0  5444 16820 51750 147625 16  2 82  0  0  
 5  0 685584 2251820 310860 23920884    0    0  6864 14084 53609 146373 15  2 82  0  0  
 6  0 685584 2251952 310860 23921996    0    0  6016 13112 52862 147601 15  2 82  0  0  
10  0 685584 2250164 310864 23923056    0    0  6096 14468 52406 147109 15  2 82  0  0  
 7  0 685584 2248936 310864 23924112    0    0  6576 13664 51638 142051 15  2 83  0  0

Best Answer

Query cache off -- good

Large history list -- Innodb stumbling over itself.

Latency suddenly shot up -- ditto

Thread count suddenly shot up -- ditto.

Possible causes:

  • A long-running UPDATE or ALTER blocking lots of other threads. Or even a SELECT that was poorly written. If the slowlog was on, you may be able to discover what it was. long_query_time = 0.5 implies that (if the slowlog was turned on) it will have caught the naughty query. And there would be a lot of normally-fast queries in the log, too.
  • Too many threads. What is "too many"? When this happens. How to control it? Decrease the limit on number of connections, either in the clients, or with max_connections. I know that sounds like a bad remedy, but when MySQL gets into the state you described, it may take a restart to uncork it. Sharing resources among 300-400 threads means that no one will 'ever' get finished. How many clients? Web server(s)? What is the limit on each? Example: Apache's MaxClients.
  • Swapping. Do you have any monitoring that would say that there was swapping? 90G/128G sounds fine, but perhaps something else was eating into RAM. Swapping is terrible for MySQL.