Mysql – How to spec a machine for a particular database load

hardwaremariadbMySQLoptimizationperformance

Our primary authentication database that is used for 100k+ users, and growing is a VM on a shared machine that we want to extract into its own isolated hardware. This will let us isolate resources, and help bring clarity to optimizations. However, I'm not clear on what would be an ideal machine for what we are doing, or how to determine that. So I turn to you for some advice.

  • We are using MariaDB 10.1.26 (formerly mysql) on a Debian Stretch 64-bit VM with 10GB of memory, and 6 cores allocated to it. The CPU is mostly idle.
  • There are approximately 320 tables across three databases, no table is very large, each under 100mb.
  • Size of data on disk is 24GB, with ibdata1 being 2.2GB. Data in InnoDB tables: 1019.8M (Tables: 37).
  • Reads / Writes: 98% / 2%
  • Total buffers: 1.9G global + 2.9M per thread (2500 max threads)
  • Maximum reached memory usage: 4.6G (47.52% of installed RAM)

This machine is replicated to a secondary (and then from the secondary to a tertiary), so relay logs take up a bunch of that space. This server is used for account authentication, it typically is doing SELECTs, and Reads on InnoDB rows so it has very little data, but it needs to be functioning and fast. On
average, there are 230 connected threads, but there are periods where spikes of connections cause us to reach max_connections, and we have been increasing that as we go along, and have been tuning various parameters, but we are maxed out on the available memory we can allocate. There are also situations where some queries are taking longer than they should, this may be a network issue, or when we are reaching max connections, there is nothing that shows up in the slow query log, as they are typical queries that are usually quite fast, but sometimes take longer than they should.

  • We have zero slow queries (out of 28M queries)
  • extremely few aborted connections
  • no sorts requiring temporary tables
  • and no joins without indexes.

We do have temporary tables created on disk: 42% (1K on disk / 3K
total), which is a bit high, and I'm unsure why that is
happening.

  • Our thread cache hit rate: 99% (735 created / 913K connections)
  • table cache hit rate: 97% (253 open / 259 opened).
  • open file limit used: 0% (50/16K)
  • 99% of our table locks are acquired immediately (out of 29M).

We are using InnoDB for all tables. Some InnoDB stats:

  • InnoDB Thread Concurrency: 0
  • InnoDB File per table is activated
  • InnoDB buffer pool / data size: 1.2G/1019.8M
  • InnoDB buffer pool instances: 1
  • InnoDB Read buffer efficiency: 99.98% (240354196 hits/ 240398682 total)
  • InnoDB Write Log efficiency: 75.01% (302451 hits/ 403237 total)
  • InnoDB log waits: 0.00% (0 waits / 100786 writes)

I've got munin graphs for the last year of the following, which I'm happy to provide on request. There are too many of them to individually provide: binary/relay log usage; different commands/second; threads and connection counts; table cache, open files, open tables; handler activity (write, update, delete, etc); InnoDB Buffer Pool size, pages and modified pages; InnoDB Buffer Pool activity (pages read, created, and written); InnoDB Checkpoint Age; InnoDB History List length; Number and size InnoDB Insert Buffers; InnoDB IO (file reads/writes, log writes and file syncs); InnoDB IO Pending; InnoDB Log buffer size, KB flushed and Written; InnoDB Row Operations; InnoDB Semaphores; InnoDB Transactions. I also have types of sorts; number of table locks; temporary disk tables; and number of threads.

  • tps = 3.342
  • qps = 270.123

based on the following:

use information_schema;
select VARIABLE_VALUE into @num_queries from GLOBAL_STATUS where VARIABLE_NAME = 'QUESTIONS';
select VARIABLE_VALUE into @uptime from GLOBAL_STATUS where VARIABLE_NAME = 'UPTIME';
select VARIABLE_VALUE into @num_com from GLOBAL_STATUS where VARIABLE_NAME = 'COM_COMMIT';
select VARIABLE_VALUE into @num_roll from GLOBAL_STATUS where VARIABLE_NAME = 'COM_ROLLBACK';
select (@num_com + @num_roll) / @uptime as tps, @num_queries / @uptime as qps;
  • SHOW VARIABLES
  • SHOW GLOBAL VARIABLES
  • SHOW GLOBAL STATUS
  • iostat -x output
  • ulimit -a:

    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 39926
    max locked memory (kbytes, -l) 64
    max memory size (kbytes, -m) unlimited
    open files (-n) 1024
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 8192
    cpu time (seconds, -t) unlimited
    max user processes (-u) 39926
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited
  • df -h:

    Filesystem Size Used Avail Use% Mounted on
    udev 4.9G 0 4.9G 0% /dev
    tmpfs 1001M 452K 1000M 1% /run
    /dev/vda1 45G 28G 15G 65% /
    tmpfs 4.9G 0 4.9G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 4.9G 0 4.9G 0% /sys/fs/cgroup
    tmpfs 1001M 0 1001M 0% /run/user/0
  • free -h:

    total used free shared buff/cache available
    Mem: 9.8G 2.8G 4.1G 452K 2.8G 6.7G
    Swap: 0B 0B 0B

Best Answer

Most tables are MyISAM? That may be holding you back when it comes to performance. Conflicting info: "320 tables"; "Data in InnoDB tables: 1019.8M (Tables: 37)."; "We are using InnoDB for all tables"

"Maximum reached memory usage: 4.6G", "10G of RAM" -- Do you have other applications running in the same VM? If not you may be under-utilizing RAM. Or maybe not: "InnoDB buffer pool / data size: 1.2G/1019.8M"

"secondary to a tertiary" -- Why? There are valid reasons for such, but if the secondary dies, it leaves the tertiary useless.

"spikes of connections cause us to reach max_connections" -- This could be a serious problem. Is it caused by table locks in MyISAM? Slow queries? DDL operations (ALTER, etc)? There are cases where throttling the connections at the client (eg webserver) is better than increasing max_connections. I need to get a better feel for what is going on.

During a spike, does Threads_running go up? When that STATUS is more than, say, 20, MySQL is likely stumbling over itself. Latency suffers, and throughput stagnates, or even declines.

"We have zero slow queries". I'll bet long_query_time is still at the useless default of 10 seconds. Change it to 1.

"temporary tables created on disk: 42%" -- bad. But let's approach it by finding the 'worst' queries via the slowlog.

Please follow the slowlog suggestions in http://mysql.rjweb.org/doc.php/mysql_analysis

Graphs: Most will be boring. But spikes might be interesting. Especially lining up the graphs that show abnormal activity at the same time as your "spike of connections". (It might suffice to list the metrics an not actually provide the graphs.)

270 qps & 3 tps -- modest. (One survey show 100 qps to be about the median; 1900 to be the 90th percentile.)

You apparently have few tables since you have not slammed into open_file_limit = 1024.

"You can't tune your way out of a performance problem" (and most of the tunables are OK). So, I think the Slowlog (see link above) is the best next step.

Analysis of GLOBAL STATUS and VARIABLES

Observations:

  • Version: 10.1.26-MariaDB-0+deb9u1
  • 10 GB of RAM
  • Uptime = 1d 05:53:01
  • You are not running on Windows.
  • Running 64-bit version
  • You appear to be running entirely (or mostly) InnoDB.

The More Important Issues:

The values here say you are running mostly InnoDB.

key_buffer_size = 50M
innodb_buffer_pool_size = 4G

Innodb rows deleted is amazingly close to rows inserted -- what kind of processing are you doing?

Spurious COMMITs? (Com_insert + Com_update + Com_delete + Com_replace) < Com_commit

Change slowlog parameters.

20 admin commands per second -- What is going on?

Details and other observations:

( (key_buffer_size - 1.2 * Key_blocks_used * 1024) / _ram ) = (512M - 1.2 * 1449 * 1024) / 10240M = 5.0% -- Percent of RAM wasted in key_buffer. -- Decrease key_buffer_size.

( Key_blocks_used * 1024 / key_buffer_size ) = 1,449 * 1024 / 512M = 0.28% -- Percent of key_buffer used. High-water-mark. -- Lower key_buffer_size to avoid unnecessary memory usage.

( innodb_buffer_pool_size / _ram ) = 1250M / 10240M = 12.2% -- % of RAM used for InnoDB buffer_pool

( (key_buffer_size / 0.20 + innodb_buffer_pool_size / 0.70) / _ram ) = (512M / 0.20 + 1250M / 0.70) / 10240M = 42.4% -- Most of available ram should be made available for caching. -- http://mysql.rjweb.org/doc.php/memory

( innodb_buffer_pool_size ) = 1250M -- InnoDB Data + Index cache -- 128M (an old default) is woefully small.

( Innodb_buffer_pool_pages_free / Innodb_buffer_pool_pages_total ) = 21,062 / 79999 = 26.3% -- Pct of buffer_pool currently not in use -- innodb_buffer_pool_size is bigger than necessary?

( Innodb_pages_written / Innodb_buffer_pool_write_requests ) = 855,522 / 3597243 = 23.8% -- Write requests that had to hit disk -- Check innodb_buffer_pool_size

( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 252,449,792 / (107581 / 3600) / 2 / 156M = 0.0258 -- Ratio -- (see minutes)

( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 107,581 / 60 * 156M / 252449792 = 1,161 -- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size. (Cannot change in AWS.)

( default_tmp_storage_engine ) = default_tmp_storage_engine =

( Innodb_rows_deleted / Innodb_rows_inserted ) = 50,443 / 50359 = 1 -- Churn -- "Don't queue it, just do it." (If MySQL is being used as a queue.)

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( max_connections ) = 2,500 -- Maximum number of connections (threads). Impacts various allocations. -- If max_connections is too high and various memory settings are high, you could run out of RAM.

( innodb_buffer_pool_populate ) = OFF = 0 -- NUMA control

( query_alloc_block_size / _ram ) = 16,384 / 10240M = 0.00% -- For parsing. Pct of RAM

( local_infile ) = local_infile = ON -- local_infile = ON is a potential security issue

( Created_tmp_disk_tables / Created_tmp_tables ) = 1,400 / 3347 = 41.8% -- Percent of temp tables that spilled to disk -- Maybe increase tmp_table_size and max_heap_table_size; improve indexes; avoid blobs, etc.

( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (54024 + 289659 + 6707 + 600) / 360402 = 0.974 -- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.

( binlog_format ) = binlog_format = STATEMENT -- STATEMENT/ROW/MIXED. ROW is preferred; it may become the default.

( expire_logs_days ) = 0 -- How soon to automatically purge binlog (after this many days) -- Too large (or zero) = consumes disk space; too small = need to respond quickly to network/machine crash. (Not relevant if log_bin = OFF)

( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

( long_query_time ) = 5 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( Connections ) = 922,332 / 107581 = 8.6 /sec -- Connections -- Increase wait_timeout; use pooling?

( thread_cache_size ) = 256 -- How many extra processes to keep around (Not relevant when using thread pooling) (Autosized as of 5.6.8; based on max_connections) 0 is inefficient for non-Windows; 10 is probably fine. More than 100 may lead to OOM.

Abnormally small:

Handler_read_next / Handler_read_key = 0.557
Handler_read_rnd_deleted = 0
Handler_tmp_write = 6 /sec
Innodb_buffer_pool_pages_made_young = 0.57 /HR
Innodb_secondary_index_triggered_cluster_reads = 10.2MB
Rows_tmp_read = 0.8M
interactive_timeout = 300
max_heap_table_size = 1MB
max_tmp_tables * tmp_table_size / _ram = 0.31%
min(max_heap_table_size, tmp_table_size) = 1MB

Abnormally large:

Acl_column_grants = 27
Acl_function_grants = 2
Acl_procedure_grants = 5
Acl_table_grants = 37
Com_admin_commands = 20 /sec
Com_drop_db = 0.1 /HR
Com_show_binlogs = 12 /HR
Com_show_slave_hosts = 0.033 /HR
Handler_discover = 1.7 /HR
Innodb_read_views_memory = 23,864
Max_used_connections = 691
Slave_connections = 2
Slaves_connected = 0.033 /HR
Threads_cached = 150
back_log = 550
host_cache_size = 728
innodb_defragment_fill_factor = 0.9
innodb_lru_scan_depth / innodb_io_capacity = 5.12
max_relay_log_size = 1024MB

Abnormal strings:

Slave_heartbeat_period = 1800
innodb_fast_shutdown = 1
myisam_stats_method = NULLS_UNEQUAL
opt_s__engine_condition_pushdown = off