MySQL Performance – Troubleshooting Sudden Usage Spikes

innodbMySQLmysql-5.6

We have a site with MySQL DB running properly for more than 2 years.

However, 2 weeks ago we noticed a sudden spike in load. We started the MySQL service and load normalized. We assumed the issue was one off and forgot about it.

But for last 1 week are seeing the MySQL load hitting high almost on a daily basis. Sometimes just a few minutes after restart. It has rendered the site almost unussable.

A check on the log does not show any query that would cause a spike. Majority of the queries (about 90%) take less than 10ms. A few (about 9%) takes between 10 and 20ms. There are still some that take between 20 and 30ms. But this are are negligible, accounting for about 1%. Apart from 2 long-running queries that runs when the app starts, I haven't seen any regular query that takes more than 30ms.

$ uname -a
Linux 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ mysql --version
mysql  Ver 14.14 Distrib 5.6.27, for Linux (x86_64) using  EditLine wrapper            

Any reason why MySQL would suddenly become unstable?

If, for some reason (s), we have just hit limitation for MySQL (probably an InnoDB issue) would it help if we moved to MariaDB or Persona Server (that uses XtraDB)?

Edit 1: Additional information

Something I have noticed is at the time when performance degrades there are several SELECT queries. They seem not any big resource intensive queries (the longest takes about 4ms) but they occur in bursts, and probably this could be the problem.

The `'burst queries' looks valid, and not as a result of DOS attacks or XSS.

**Edit 2: MySQL conf*

$ cat /etc/my.cnf
[mysqld]
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under different user or group, 
# customize your systemd unit file for mysqld
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
tmpdir=/var/mysqltmp

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

open_files_limit=50000
query_cache_size=2048M
max_connections=100
max_user_connections=25
wait_timeout=20
tmp_table_size=6144M
max_heap_table_size=6144M
thread_cache_size=64
key_buffer_size=8192M
max_allowed_packet=268435456
#table_cache=32768
table_open_cache=32768
table_definition_cache=32768

delayed_insert_timeout=20 # Turn on if max_connections being reached due to delayed inserts
delayed_queue_size=300 # Turn on if max_connections being reached due to delayed inserts

myisam_sort_buffer_size=512M # can be increased per sessions if needed for alter tables (indexes, repair)

query_cache_limit=32M # leave at default unless there is a good reason
query_cache_type=1
join_buffer=32M # leave at default unless there is a good reason
sort_buffer_size=32M # leave at default unless there is a good reason
read_rnd_buffer_size=16M # leave at default unless there is a good reason
read_buffer_size=32M # leave at default unless there is a good reason

collation_server=utf8_unicode_ci
character_set_server=utf8

general_log=0
log-output=TABLE # select * from mysql.general_log order by event_time desc limit 10;

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
innodb_file_per_table=1
innodb_buffer_pool_size=34406M
innodb_additional_mem_pool_size=7424M
innodb_log_buffer_size=7424M
innodb_thread_concurrency=8 # Number of physical + virtual CPU's, preset when server is provisioned to have correct # of cores
innodb_change_buffering=all
innodb_flush_log_at_trx_commit=0
innodb_support_xa=0
innodb_doublewrite = 0

default-storage-engine=MyISAM


[mysqld_safe]
#malloc-lib=/usr/lib64/libtcmalloc_minimal.so.4
malloc-lib=/usr/lib64/libtcmalloc_minimal.so.4
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

my.cnf shows default-storage-engine=MyISAM but all tables I have created are InnoDB.

Environment

How much RAM? 48GB

Using InnoDB? Yes

How many cores? 8

Best Answer

This is a common issue as you scale to some point. There could be many causes; let's first search for the cause; then the solution.

long_query_time=0
turn on the slowlog
run for a few hours
turn off the slowlog (or raise the cutoff to 1)
run pt-query-digest to find the 'worst' couple of queries.

Those queries may be zillions of 10ms queries, or it may be a few 1-minute queries. In either case, look carefully at whether

  • you need them that often,
  • they have optimal indexes,
  • they can be combined in some more efficient way
  • they can be rewritten
  • etc.

Check some tunables

  • How much RAM do you have?
  • Are you using InnoDB?
  • What is the value of innodb_buffer_pool_size?
  • If you provide me with SHOW GLOBAL STATUS; and SHOW VARIABLES;, I will search for other issues.

Other likely causes

  • UUIDs are terrible as keys; as soon as the data exceeds cache, you are in deep dodo.
  • EAV schema has problems
  • How many cores? What is max_connections, Max_used_connections? What version? (You could have the "thundering herd" or "too many people in the grocery store" syndrone.)

Tuning Analysis

Observations:

Version: 5.6.27
48 GB of RAM
Uptime = 03:06:22; Please rerun SHOW GLOBAL STATUS after several hours.
You are not running on Windows.
Running 64-bit version
You appear to be running entirely (or mostly) InnoDB.

The More Important Issues

Since you are not(?) using MyISAM, decrease key_buffer_size to only 10M.

innodb_log_buffer_size is over 7G, 15% of RAM -- far too big; decrease to 100M.

innodb_additional_mem_pool_size is no longer used; remove it from my.cnf

It is terribly dangerous to set tmp_table_size and max_heap_table_size too high. Recommend 1% of RAM size, not 12%.

query_cache_type = ON and query_cache_size = 2G -- This may be the worst performance hit. No more than 100M for the size. When it is 2G, each write spends a lot of time going through Query Cache to purge entries for the modified table. Furthermore, the metrics on the QC say it is not very effective. So even consider turning it OFF.

Several STATUS metrics indicate lack of indexes and/or poorly written queries. Let's take a look at a couple of them.

innodb_doublewrite = OFF -- this may (rarely) lead to "torn pages" after a crash.

There seems to be no MyISAM activity; this is unexpected because some system tables are MyISAM. What gives? Also puzzling: default-storage-engine=MyISAM.

Details and other observations

Memory allocation

( (key_buffer_size - 1.2 * Key_blocks_used * 1024) / _ram ) = (8589934592 - 1.2 * 0 * 1024) / 49152M = 16.7% -- Percent of RAM wasted in key_buffer. -- Decrease key_buffer_size.

( Key_blocks_used * 1024 / key_buffer_size ) = 0 * 1024 / 8589934592 = 0 -- Percent of key_buffer used. High-water-mark. -- Lower key_buffer_size to avoid unnecessary memory usage.

( open_files_limit ) = 1,024 -- ulimit -n -- To allow more files, change ulimit or /etc/security/limits.conf or in sysctl.conf (kern.maxfiles & kern.maxfilesperproc) or something else (OS dependent)

( innodb_log_buffer_size ) = 7G -- far too big; decrease to 100M

( join_buffer_size ) = 32M -- 0-N per thread. May speed up JOINs (better to fix queries/indexes) (all engines) Used for index scan, range index scan, full table scan, each full JOIN, etc. -- Use the default.

( min( tmp_table_size, max_heap_table_size ) / _ram ) = min( 6442450944, 6442450944 ) / 49152M = 12.5% -- Percent of RAM to allocate when needing MEMORY table (per table), or temp table inside a SELECT (per temp table per some SELECTs). Too high may lead to swapping. -- Decrease tmp_table_size and max_heap_table_size to, say, 1% of ram.

( Created_tmp_disk_tables / (Created_tmp_disk_tables + Created_tmp_tables) ) = 5,245 / (5245 + 19126) = 21.5% -- Percent of temp tables that spilled to disk -- maybe increase tmp_table_size and max_heap_table_size; avoid blobs, etc.

( tmp_table_size ) = 6,442,450,944 = 6144MB -- Limit on size of MEMORY temp tables used to support a SELECT -- Decrease tmp_table_size to avoid running out of RAM. Perhaps no more than 64M.

( Handler_read_rnd_next ) = 23,608,688,071 / 11182 = 2111311 /sec -- High if lots of table scans -- possibly inadequate keys

( Handler_read_rnd_next / Com_select ) = 23,608,688,071 / 116779 = 202,165 -- Avg rows scanned per SELECT. (approx) -- Consider raising read_buffer_size

( Select_scan ) = 33,170 / 11182 = 3 /sec -- full table scans -- Add indexes / optimize queries (unless they are tiny tables)

( Select_scan / Com_select ) = 33,170 / 116779 = 28.4% -- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries

( long_query_time ) = 10.000000 = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( back_log / max_connections ) = 70 / 100 = 70.0%

( Connections ) = 28,080 / 11182 = 2.5 /sec -- Connections -- Increase wait_timeout; use pooling?

( thread_cache_size ) = 64 -- How many extra processes to keep around (Not relevant when using thread pooling) (Autosized as of 5.6.8; based on max_connections) -- 0 is good for Windows 0 is inefficient for non-Windows; 10 is probably fine

( thread_cache_size / max_connections ) = 64 / 100 = 64.0% -- (0 for Windows)

( Com_show_collations ) = 30/minute. What's up?