Our primary authentication database that is used for 100k+ users, and growing is a VM on a shared machine that we want to extract into its own isolated hardware. This will let us isolate resources, and help bring clarity to optimizations. However, I'm not clear on what would be an ideal machine for what we are doing, or how to determine that. So I turn to you for some advice.
- We are using MariaDB 10.1.26 (formerly mysql) on a Debian Stretch 64-bit VM with 10GB of memory, and 6 cores allocated to it. The CPU is mostly idle.
- There are approximately 320 tables across three databases, no table is very large, each under 100mb.
- Size of data on disk is 24GB, with ibdata1 being 2.2GB. Data in InnoDB tables: 1019.8M (Tables: 37).
- Reads / Writes: 98% / 2%
- Total buffers: 1.9G global + 2.9M per thread (2500 max threads)
- Maximum reached memory usage: 4.6G (47.52% of installed RAM)
This machine is replicated to a secondary (and then from the secondary to a tertiary), so relay logs take up a bunch of that space. This server is used for account authentication, it typically is doing SELECTs, and Reads on InnoDB rows so it has very little data, but it needs to be functioning and fast. On
average, there are 230 connected threads, but there are periods where spikes of connections cause us to reach max_connections, and we have been increasing that as we go along, and have been tuning various parameters, but we are maxed out on the available memory we can allocate. There are also situations where some queries are taking longer than they should, this may be a network issue, or when we are reaching max connections, there is nothing that shows up in the slow query log, as they are typical queries that are usually quite fast, but sometimes take longer than they should.
- We have zero slow queries (out of 28M queries)
- extremely few aborted connections
- no sorts requiring temporary tables
- and no joins without indexes.
We do have temporary tables created on disk: 42% (1K on disk / 3K
total), which is a bit high, and I'm unsure why that is
happening.
- Our thread cache hit rate: 99% (735 created / 913K connections)
- table cache hit rate: 97% (253 open / 259 opened).
- open file limit used: 0% (50/16K)
- 99% of our table locks are acquired immediately (out of 29M).
We are using InnoDB for all tables. Some InnoDB stats:
- InnoDB Thread Concurrency: 0
- InnoDB File per table is activated
- InnoDB buffer pool / data size: 1.2G/1019.8M
- InnoDB buffer pool instances: 1
- InnoDB Read buffer efficiency: 99.98% (240354196 hits/ 240398682 total)
- InnoDB Write Log efficiency: 75.01% (302451 hits/ 403237 total)
- InnoDB log waits: 0.00% (0 waits / 100786 writes)
I've got munin graphs for the last year of the following, which I'm happy to provide on request. There are too many of them to individually provide: binary/relay log usage; different commands/second; threads and connection counts; table cache, open files, open tables; handler activity (write, update, delete, etc); InnoDB Buffer Pool size, pages and modified pages; InnoDB Buffer Pool activity (pages read, created, and written); InnoDB Checkpoint Age; InnoDB History List length; Number and size InnoDB Insert Buffers; InnoDB IO (file reads/writes, log writes and file syncs); InnoDB IO Pending; InnoDB Log buffer size, KB flushed and Written; InnoDB Row Operations; InnoDB Semaphores; InnoDB Transactions. I also have types of sorts; number of table locks; temporary disk tables; and number of threads.
- tps = 3.342
- qps = 270.123
based on the following:
use information_schema;
select VARIABLE_VALUE into @num_queries from GLOBAL_STATUS where VARIABLE_NAME = 'QUESTIONS';
select VARIABLE_VALUE into @uptime from GLOBAL_STATUS where VARIABLE_NAME = 'UPTIME';
select VARIABLE_VALUE into @num_com from GLOBAL_STATUS where VARIABLE_NAME = 'COM_COMMIT';
select VARIABLE_VALUE into @num_roll from GLOBAL_STATUS where VARIABLE_NAME = 'COM_ROLLBACK';
select (@num_com + @num_roll) / @uptime as tps, @num_queries / @uptime as qps;
- SHOW VARIABLES
- SHOW GLOBAL VARIABLES
- SHOW GLOBAL STATUS
- iostat -x output
- ulimit -a:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 39926
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 39926
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited - df -h:
Filesystem Size Used Avail Use% Mounted on
udev 4.9G 0 4.9G 0% /dev
tmpfs 1001M 452K 1000M 1% /run
/dev/vda1 45G 28G 15G 65% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 4.9G 0 4.9G 0% /sys/fs/cgroup
tmpfs 1001M 0 1001M 0% /run/user/0
- free -h:
total used free shared buff/cache available
Mem: 9.8G 2.8G 4.1G 452K 2.8G 6.7G
Swap: 0B 0B 0B
Best Answer
Most tables are MyISAM? That may be holding you back when it comes to performance. Conflicting info: "320 tables"; "Data in InnoDB tables: 1019.8M (Tables: 37)."; "We are using InnoDB for all tables"
"Maximum reached memory usage: 4.6G", "10G of RAM" -- Do you have other applications running in the same VM? If not you may be under-utilizing RAM. Or maybe not: "InnoDB buffer pool / data size: 1.2G/1019.8M"
"secondary to a tertiary" -- Why? There are valid reasons for such, but if the secondary dies, it leaves the tertiary useless.
"spikes of connections cause us to reach max_connections" -- This could be a serious problem. Is it caused by table locks in MyISAM? Slow queries? DDL operations (
ALTER
, etc)? There are cases where throttling the connections at the client (eg webserver) is better than increasingmax_connections
. I need to get a better feel for what is going on.During a spike, does
Threads_running
go up? When thatSTATUS
is more than, say, 20, MySQL is likely stumbling over itself. Latency suffers, and throughput stagnates, or even declines."We have zero slow queries". I'll bet
long_query_time
is still at the useless default of 10 seconds. Change it to 1."temporary tables created on disk: 42%" -- bad. But let's approach it by finding the 'worst' queries via the slowlog.
Please follow the slowlog suggestions in http://mysql.rjweb.org/doc.php/mysql_analysis
Graphs: Most will be boring. But spikes might be interesting. Especially lining up the graphs that show abnormal activity at the same time as your "spike of connections". (It might suffice to list the metrics an not actually provide the graphs.)
270 qps & 3 tps -- modest. (One survey show 100 qps to be about the median; 1900 to be the 90th percentile.)
You apparently have few tables since you have not slammed into open_file_limit = 1024.
"You can't tune your way out of a performance problem" (and most of the tunables are OK). So, I think the Slowlog (see link above) is the best next step.
Analysis of GLOBAL STATUS and VARIABLES
Observations:
The More Important Issues:
The values here say you are running mostly InnoDB.
Innodb rows deleted is amazingly close to rows inserted -- what kind of processing are you doing?
Spurious
COMMITs
?(Com_insert + Com_update + Com_delete + Com_replace) < Com_commit
Change slowlog parameters.
20 admin commands per second -- What is going on?
Details and other observations:
( (key_buffer_size - 1.2 * Key_blocks_used * 1024) / _ram ) = (512M - 1.2 * 1449 * 1024) / 10240M = 5.0%
-- Percent of RAM wasted in key_buffer. -- Decrease key_buffer_size.( Key_blocks_used * 1024 / key_buffer_size ) = 1,449 * 1024 / 512M = 0.28%
-- Percent of key_buffer used. High-water-mark. -- Lower key_buffer_size to avoid unnecessary memory usage.( innodb_buffer_pool_size / _ram ) = 1250M / 10240M = 12.2%
-- % of RAM used for InnoDB buffer_pool( (key_buffer_size / 0.20 + innodb_buffer_pool_size / 0.70) / _ram ) = (512M / 0.20 + 1250M / 0.70) / 10240M = 42.4%
-- Most of available ram should be made available for caching. -- http://mysql.rjweb.org/doc.php/memory( innodb_buffer_pool_size ) = 1250M
-- InnoDB Data + Index cache -- 128M (an old default) is woefully small.( Innodb_buffer_pool_pages_free / Innodb_buffer_pool_pages_total ) = 21,062 / 79999 = 26.3%
-- Pct of buffer_pool currently not in use -- innodb_buffer_pool_size is bigger than necessary?( Innodb_pages_written / Innodb_buffer_pool_write_requests ) = 855,522 / 3597243 = 23.8%
-- Write requests that had to hit disk -- Check innodb_buffer_pool_size( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 252,449,792 / (107581 / 3600) / 2 / 156M = 0.0258
-- Ratio -- (see minutes)( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 107,581 / 60 * 156M / 252449792 = 1,161
-- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size. (Cannot change in AWS.)( default_tmp_storage_engine ) = default_tmp_storage_engine =
( Innodb_rows_deleted / Innodb_rows_inserted ) = 50,443 / 50359 = 1
-- Churn -- "Don't queue it, just do it." (If MySQL is being used as a queue.)( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF
-- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.( max_connections ) = 2,500
-- Maximum number of connections (threads). Impacts various allocations. -- If max_connections is too high and various memory settings are high, you could run out of RAM.( innodb_buffer_pool_populate ) = OFF = 0
-- NUMA control( query_alloc_block_size / _ram ) = 16,384 / 10240M = 0.00%
-- For parsing. Pct of RAM( local_infile ) = local_infile = ON
-- local_infile = ON is a potential security issue( Created_tmp_disk_tables / Created_tmp_tables ) = 1,400 / 3347 = 41.8%
-- Percent of temp tables that spilled to disk -- Maybe increase tmp_table_size and max_heap_table_size; improve indexes; avoid blobs, etc.( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (54024 + 289659 + 6707 + 600) / 360402 = 0.974
-- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.( binlog_format ) = binlog_format = STATEMENT
-- STATEMENT/ROW/MIXED. ROW is preferred; it may become the default.( expire_logs_days ) = 0
-- How soon to automatically purge binlog (after this many days) -- Too large (or zero) = consumes disk space; too small = need to respond quickly to network/machine crash. (Not relevant if log_bin = OFF)( slow_query_log ) = slow_query_log = OFF
-- Whether to log slow queries. (5.1.12)( long_query_time ) = 5
-- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2( Connections ) = 922,332 / 107581 = 8.6 /sec
-- Connections -- Increase wait_timeout; use pooling?( thread_cache_size ) = 256
-- How many extra processes to keep around (Not relevant when using thread pooling) (Autosized as of 5.6.8; based on max_connections) 0 is inefficient for non-Windows; 10 is probably fine. More than 100 may lead to OOM.Abnormally small:
Abnormally large:
Abnormal strings: