MariaDB Performance – Optimizing for 5-10K Small Databases

mariadbperformance

Running a SaaS based on MariaDB LTS (1:10.1.44-0ubuntu0.18.04.1) with nearly 5000 databases (1 per tenant). Average per tenant: 8MB of data (on disk) with ~50 tables. CPU/Mem load is negligible. It's been running fine for a couple years, with some 5-20 tenants added per day.

However. When the nr of dbs grows to about 5150, the MariaDB server process crashes. So far I haven't been able to distill a useful log message, unfortunately. Furthermore, it takes about 10 minutes for MariaDB to start up, which is quite long when I want to experiment with new settings.

What settings would you recommend to reduce the risk of crashing, and speed up starting time?

Some more characteristics:

  • SHOW GLOBAL STATUS
  • SHOW VARIABLES
  • Server has 2CPU and 8GB ram, of which 25% is available during peak times.
  • Load per CPU core never exceeds 0.20 during peak times
  • Connections are short lived, I expect that there are never more than 20 concurrent db connections.
  • All tables are InnoDB
  • Schema may differ per database (but are probably identical for 98% of tenants)
  • I use stock configuration, except for:
innodb_file_per_table
open_files_limit        = 65536
log_warnings            = 3
innodb_buffer_pool_size = 1G

FWIW, I'm willing to upgrade to a newer version of MariaDB, but given the risk of new uncertainties, I'd rather make an informed decision than take a leap of faith.

Thanks for any advice!

Best Answer

Observations:

  • Version: 10.1.44-MariaDB-0ubuntu0.18.04.1
  • 8 GB of RAM
  • Uptime = 4d 23:42:13
  • You are not running on Windows.
  • Running 64-bit version
  • You appear to be running entirely (or mostly) InnoDB.

The More Important Issues:

Unless you have a lot of other apps on the server, innodb_buffer_pool_size should be increased, perhaps to 5G.

There is pressure on the "table_open_cache", which is currently set to 2000; double that.

Increase table_definition_cache from 400 to 1000

innodb_log_file_size is rather small. However, making it larger is a pain for your old version, so I don't recommend at the moment.

Are you using ARIA?

A lot of complex queries. See this for investigating them: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

You have a bunch of stored procedures? And some of them do PREPARE and EXECUTE? Some of them seem to fail to CLOSE (DEALLOCATE) them. This may lead to excessive memory usage. (I don't know if this could be causing the crash, but you should clean them up.) And consider subquery_cache=off (see optimizer_switch)

Details and other observations:

( Opened_tables ) = 2,098,690 / 430933 = 4.9 /sec -- Frequency of opening Tables -- increase table_open_cache (now 2000)

( Opened_table_definitions ) = 1,397,838 / 430933 = 3.2 /sec -- Frequency of opening .frm files -- Increase table_definition_cache (now 400) and/or table_open_cache (now 2000).

( innodb_lru_scan_depth ) = 1,024 -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

( innodb_io_capacity_max / innodb_io_capacity ) = 2,000 / 200 = 10 -- Capacity: max/plain -- Recommend 2. Max should be about equal to the IOPs your I/O subsystem can handle. (If the drive type is unknown 2000/200 may be a reasonable pair.)

( Innodb_pages_written / Innodb_buffer_pool_write_requests ) = 1,753,906 / 7374538 = 23.8% -- Write requests that had to hit disk -- Check innodb_buffer_pool_size (now 1073741824)

( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 737,214,976 / (430933 / 3600) / 2 / 48M = 0.0612 -- Ratio -- (see minutes)

( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 430,933 / 60 * 48M / 737214976 = 490 -- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 50331648). (Cannot change in AWS.)

( innodb_flush_method ) = innodb_flush_method = -- How InnoDB should ask the OS to write blocks. Suggest O_DIRECT or O_ALL_DIRECT (Percona) to avoid double buffering. (At least for Unix.) See chrischandler for caveat about O_ALL_DIRECT

( default_tmp_storage_engine ) = default_tmp_storage_engine =

( innodb_flush_neighbors ) = 1 -- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.

( innodb_io_capacity ) = 200 -- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.

( Innodb_deadlocks ) = 2 / 430933 = 0.017 /HR -- Deadlocks -- SHOW ENGINE INNODB STATUS; to see the latest pair of queries that deadlocked.

( sync_binlog ) = 0 -- Use 1 for added security, at some cost of I/O =1 may lead to lots of "query end"; =0 may lead to "binlog at impossible position" and lose transactions in a crash, but is faster.

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( innodb_buffer_pool_populate ) = OFF = 0 -- NUMA control

( log_warnings ) = log_warnings = 3

( (Com_show_create_table + Com_show_fields) / Questions ) = (622768 + 622768) / 21607187 = 5.8% -- Naughty framework -- spending a lot of effort rediscovering the schema. -- Complain to the 3rd party vendor.

( local_infile ) = local_infile = ON -- local_infile (now ON) = ON is a potential security issue

( Qcache_lowmem_prunes/Qcache_inserts ) = 2,656,624/4159715 = 63.9% -- Removal Ratio (frequency of needing to prune due to not enough memory)

( (query_cache_size - Qcache_free_memory) / Qcache_queries_in_cache / query_alloc_block_size ) = (16M - 4531192) / 4235 / 16384 = 0.176 -- query_alloc_block_size vs formula -- Adjust query_alloc_block_size (now 16384)

( Created_tmp_disk_tables ) = 1,309,758 / 430933 = 3 /sec -- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216). Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.

( Created_tmp_disk_tables / Questions ) = 1,309,758 / 21607187 = 6.1% -- Pct of queries that needed on-disk tmp table. -- Better indexes / No blobs / etc.

( Created_tmp_disk_tables / Created_tmp_tables ) = 1,309,758 / 2706252 = 48.4% -- Percent of temp tables that spilled to disk -- Maybe increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216); improve indexes; avoid blobs, etc.

( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (27132 + 306536 + 20780 + 1296) / 372094 = 0.956 -- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.

( Select_scan ) = 3,511,698 / 430933 = 8.1 /sec -- full table scans -- Add indexes / optimize queries (unless they are tiny tables)

( Select_scan / Com_select ) = 3,511,698 / 14426841 = 24.3% -- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries

( Com_stmt_prepare - Com_stmt_close ) = 12,457,121 - 12445287 = 11,834 -- How many prepared statements have not been closed. -- CLOSE prepared statements

( binlog_format ) = binlog_format = STATEMENT -- STATEMENT/ROW/MIXED. -- ROW is preferred by 5.7 (10.3)

( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

( long_query_time ) = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( Subquery_cache_hit / ( Subquery_cache_hit + Subquery_cache_miss ) ) = 356,216 / ( 356216 + 10183006 ) = 3.4% -- Subquery cache hit rate

( log_slow_slave_statements ) = log_slow_slave_statements = OFF -- (5.6.11, 5.7.1) By default, replicated statements won't show up in the slowlog; this causes them to show. -- It can be helpful in the slowlog to see writes that could be interfering with Slave reads.

Abnormally small:

Com_show_status = 0.0084 /HR
Handler_write = 0.11 /sec

Abnormally large:

Acl_database_grants = 4,656
Acl_users = 4,658
Aria_pagecache_reads = 2.9 /sec
Com_create_db = 0.9 /HR
Com_drop_db = 0.14 /HR
Com_drop_user = 0.14 /HR
Com_grant = 0.9 /HR
Com_release_savepoint = 0.067 /HR
Com_rollback_to_savepoint = 1.9 /HR
Com_savepoint = 0.067 /HR
Com_show_create_table = 1.4 /sec
Com_show_fields = 1.4 /sec
Com_stmt_close = 29 /sec
Com_stmt_prepare = 29 /sec
Com_unlock_tables = 0.06 /sec
Feature_fulltext = 0.067 /sec
Feature_subquery = 34 /sec
Feature_timezone = 0.054 /sec
Handler_savepoint = 0.067 /HR
Handler_savepoint_rollback = 1.9 /HR
Innodb_pages0_read = 223,449
Tc_log_page_size = 4,096

Abnormal strings:

innodb_default_row_format = compact
innodb_fast_shutdown = 1