MariaDB Performance – Optimizing for 5-10K Small Databases

mariadbperformance

Running a SaaS based on MariaDB LTS (1:10.1.44-0ubuntu0.18.04.1) with nearly 5000 databases (1 per tenant). Average per tenant: 8MB of data (on disk) with ~50 tables. CPU/Mem load is negligible. It's been running fine for a couple years, with some 5-20 tenants added per day.

However. When the nr of dbs grows to about 5150, the MariaDB server process crashes. So far I haven't been able to distill a useful log message, unfortunately. Furthermore, it takes about 10 minutes for MariaDB to start up, which is quite long when I want to experiment with new settings.

What settings would you recommend to reduce the risk of crashing, and speed up starting time?

Some more characteristics:

SHOW GLOBAL STATUS
SHOW VARIABLES
Server has 2CPU and 8GB ram, of which 25% is available during peak times.
Load per CPU core never exceeds 0.20 during peak times
Connections are short lived, I expect that there are never more than 20 concurrent db connections.
All tables are InnoDB
Schema may differ per database (but are probably identical for 98% of tenants)
I use stock configuration, except for:

innodb_file_per_table
open_files_limit        = 65536
log_warnings            = 3
innodb_buffer_pool_size = 1G

FWIW, I'm willing to upgrade to a newer version of MariaDB, but given the risk of new uncertainties, I'd rather make an informed decision than take a leap of faith.

Thanks for any advice!

Best Answer

Observations:

Version: 10.1.44-MariaDB-0ubuntu0.18.04.1
8 GB of RAM
Uptime = 4d 23:42:13
You are not running on Windows.
Running 64-bit version
You appear to be running entirely (or mostly) InnoDB.

The More Important Issues:

Unless you have a lot of other apps on the server, innodb_buffer_pool_size should be increased, perhaps to 5G.

There is pressure on the "table_open_cache", which is currently set to 2000; double that.

Increase table_definition_cache from 400 to 1000

innodb_log_file_size is rather small. However, making it larger is a pain for your old version, so I don't recommend at the moment.

Are you using ARIA?

A lot of complex queries. See this for investigating them: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

You have a bunch of stored procedures? And some of them do PREPARE and EXECUTE? Some of them seem to fail to CLOSE (DEALLOCATE) them. This may lead to excessive memory usage. (I don't know if this could be causing the crash, but you should clean them up.) And consider subquery_cache=off (see optimizer_switch)

Details and other observations:

( Opened_tables ) = 2,098,690 / 430933 = 4.9 /sec -- Frequency of opening Tables -- increase table_open_cache (now 2000)

( Opened_table_definitions ) = 1,397,838 / 430933 = 3.2 /sec -- Frequency of opening .frm files -- Increase table_definition_cache (now 400) and/or table_open_cache (now 2000).

( innodb_lru_scan_depth ) = 1,024 -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

( innodb_io_capacity_max / innodb_io_capacity ) = 2,000 / 200 = 10 -- Capacity: max/plain -- Recommend 2. Max should be about equal to the IOPs your I/O subsystem can handle. (If the drive type is unknown 2000/200 may be a reasonable pair.)

( Innodb_pages_written / Innodb_buffer_pool_write_requests ) = 1,753,906 / 7374538 = 23.8% -- Write requests that had to hit disk -- Check innodb_buffer_pool_size (now 1073741824)

( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 737,214,976 / (430933 / 3600) / 2 / 48M = 0.0612 -- Ratio -- (see minutes)

( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 430,933 / 60 * 48M / 737214976 = 490 -- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 50331648). (Cannot change in AWS.)

( innodb_flush_method ) = innodb_flush_method = -- How InnoDB should ask the OS to write blocks. Suggest O_DIRECT or O_ALL_DIRECT (Percona) to avoid double buffering. (At least for Unix.) See chrischandler for caveat about O_ALL_DIRECT

( default_tmp_storage_engine ) = default_tmp_storage_engine =

( innodb_flush_neighbors ) = 1 -- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.

( innodb_io_capacity ) = 200 -- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.

( Innodb_deadlocks ) = 2 / 430933 = 0.017 /HR -- Deadlocks -- SHOW ENGINE INNODB STATUS; to see the latest pair of queries that deadlocked.

( sync_binlog ) = 0 -- Use 1 for added security, at some cost of I/O =1 may lead to lots of "query end"; =0 may lead to "binlog at impossible position" and lose transactions in a crash, but is faster.

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( innodb_buffer_pool_populate ) = OFF = 0 -- NUMA control

( log_warnings ) = log_warnings = 3

( (Com_show_create_table + Com_show_fields) / Questions ) = (622768 + 622768) / 21607187 = 5.8% -- Naughty framework -- spending a lot of effort rediscovering the schema. -- Complain to the 3rd party vendor.

( local_infile ) = local_infile = ON -- local_infile (now ON) = ON is a potential security issue

( Qcache_lowmem_prunes/Qcache_inserts ) = 2,656,624/4159715 = 63.9% -- Removal Ratio (frequency of needing to prune due to not enough memory)

( (query_cache_size - Qcache_free_memory) / Qcache_queries_in_cache / query_alloc_block_size ) = (16M - 4531192) / 4235 / 16384 = 0.176 -- query_alloc_block_size vs formula -- Adjust query_alloc_block_size (now 16384)

( Created_tmp_disk_tables ) = 1,309,758 / 430933 = 3 /sec -- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216). Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.

( Created_tmp_disk_tables / Questions ) = 1,309,758 / 21607187 = 6.1% -- Pct of queries that needed on-disk tmp table. -- Better indexes / No blobs / etc.

( Created_tmp_disk_tables / Created_tmp_tables ) = 1,309,758 / 2706252 = 48.4% -- Percent of temp tables that spilled to disk -- Maybe increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216); improve indexes; avoid blobs, etc.

( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (27132 + 306536 + 20780 + 1296) / 372094 = 0.956 -- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.

( Select_scan ) = 3,511,698 / 430933 = 8.1 /sec -- full table scans -- Add indexes / optimize queries (unless they are tiny tables)

( Select_scan / Com_select ) = 3,511,698 / 14426841 = 24.3% -- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries

( Com_stmt_prepare - Com_stmt_close ) = 12,457,121 - 12445287 = 11,834 -- How many prepared statements have not been closed. -- CLOSE prepared statements

( binlog_format ) = binlog_format = STATEMENT -- STATEMENT/ROW/MIXED. -- ROW is preferred by 5.7 (10.3)

( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

( long_query_time ) = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

( Subquery_cache_hit / ( Subquery_cache_hit + Subquery_cache_miss ) ) = 356,216 / ( 356216 + 10183006 ) = 3.4% -- Subquery cache hit rate

( log_slow_slave_statements ) = log_slow_slave_statements = OFF -- (5.6.11, 5.7.1) By default, replicated statements won't show up in the slowlog; this causes them to show. -- It can be helpful in the slowlog to see writes that could be interfering with Slave reads.

Abnormally small:

Com_show_status = 0.0084 /HR
Handler_write = 0.11 /sec

Abnormally large:

Acl_database_grants = 4,656
Acl_users = 4,658
Aria_pagecache_reads = 2.9 /sec
Com_create_db = 0.9 /HR
Com_drop_db = 0.14 /HR
Com_drop_user = 0.14 /HR
Com_grant = 0.9 /HR
Com_release_savepoint = 0.067 /HR
Com_rollback_to_savepoint = 1.9 /HR
Com_savepoint = 0.067 /HR
Com_show_create_table = 1.4 /sec
Com_show_fields = 1.4 /sec
Com_stmt_close = 29 /sec
Com_stmt_prepare = 29 /sec
Com_unlock_tables = 0.06 /sec
Feature_fulltext = 0.067 /sec
Feature_subquery = 34 /sec
Feature_timezone = 0.054 /sec
Handler_savepoint = 0.067 /HR
Handler_savepoint_rollback = 1.9 /HR
Innodb_pages0_read = 223,449
Tc_log_page_size = 4,096

Abnormal strings:

innodb_default_row_format = compact
innodb_fast_shutdown = 1

Related Solutions

Sql-server – Long running backups on small databases

Since the additional trace flags allowed you to narrow the delay down to the stage where the process tries to write to msdb, I'm willing to bet one or both of these returns high numbers:

SELECT MAX(row_count), SUM(in_row_reserved_page_count)
  FROM msdb.sys.dm_db_partition_stats 
  WHERE [object_id] = OBJECT_ID('msdb.dbo.backupset');

SELECT database_name, [type], c = COUNT(*)
  FROM msdb.dbo.backupset
  GROUP BY database_name, [type]
  ORDER BY c DESC;

You can consider whittling away at some of the data in that table. Of course, only you can know if you have reasons to keep backup history that otherwise should probably be cleaned up in most cases (for example, the first rule I declare to identify junk history data is for databases that no longer exist - but what if you have renamed a database recently?). So please think about each of these rules and only delete backups where the rules make sense to you based on these scenarios. For example:

CREATE TABLE #bs(id INT, ms INT);

-- Databases that no longer exist on this server:
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset 
WHERE database_name NOT IN (SELECT name FROM sys.databases);

-- All copy_only backups (because who cares?):
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset 
WHERE is_copy_only = 1;

-- All diffs except the most recent:
;WITH x AS 
(
  SELECT backup_set_id, media_set_id, 
    rn = ROW_NUMBER() OVER (PARTITION BY database_name 
      ORDER BY backup_set_id DESC)
  FROM msdb.dbo.backupset
  WHERE [type] = 'I'
  AND backup_finish_date IS NOT NULL
)
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM x WHERE rn > 1;

...

-- All log backups prior to the most recent full:
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset AS l
WHERE [type] = 'L' AND EXISTS
(
  SELECT 1 
    FROM msdb.dbo.backupset AS bs2
    WHERE database_name = l.database_name
    AND [type] = 'D' 
    AND backup_finish_date > l.backup_finish_date
);

-- All fulls except the most recent successful non-copy-only full backup:
;WITH x AS 
(
  SELECT backup_set_id, media_set_id, 
    rn = ROW_NUMBER() OVER (PARTITION BY database_name 
      ORDER BY backup_set_id DESC)
  FROM msdb.dbo.backupset
  WHERE [type] = 'D'
  AND is_copy_only = 0
  AND backup_finish_date IS NOT NULL
)
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM x WHERE rn > 1;

Need to also delete some file and filegroup information:

DELETE msdb.dbo.backupfile
WHERE backup_set_id IN (SELECT id FROM #bs);

DELETE msdb.dbo.backupfilegroup
WHERE backup_set_id IN (SELECT id FROM #bs);

DELETE mf
FROM msdb.dbo.backupmediafamily AS mf
WHERE media_set_id IN (SELECT ms FROM #bs)
AND NOT EXISTS (SELECT 1 FROM msdb.dbo.backupset
  WHERE media_set_id = mf.media_set_id);

DELETE ms
FROM msdb.dbo.backupmediaset AS ms
WHERE media_set_id IN (SELECT ms FROM #bs)
AND NOT EXISTS (SELECT 1 FROM msdb.dbo.backupset
  WHERE media_set_id = ms.media_set_id);

You also need to delete associated restore history:

DELETE msdb.dbo.restorefilegroup WHERE restore_history_id IN
(SELECT restore_history_id FROM msdb.dbo.restorehistory
  WHERE backup_set_id IN (SELECT id FROM #bs));

DELETE msdb.dbo.restorefile WHERE restore_history_id IN
(SELECT restore_history_id FROM msdb.dbo.restorehistory
  WHERE backup_set_id IN (SELECT id FROM #bs));

DELETE msdb.dbo.restorehistory
  WHERE backup_set_id IN (SELECT id FROM #bs);

And then finally the main rows can be deleted without violating any constraints:

DELETE msdb.dbo.backupset 
WHERE backup_set_id IN (SELECT id FROM #bs);

DROP TABLE #bs;

This whole process will not likely be fast - I highly recommend doing it during a maintenance window or slow period, and most certainly suspend all backup jobs while doing this.

On Standard Edition there's no real clean way to do this as an online operation, but after you've deleted all those rows, you'll likely benefit most from a full rebuild:

ALTER INDEX ALL ON msdb.dbo.backupset REBUILD;

The above was based partially on the logic included in the system procedure dbo.sp_delete_backuphistory, which leaves out some of the above criteria, has a reputation for locking up the entire set of backup- and restore-related msdb tables, and only allows a singular filter on date (which may not be very helpful depending on a variety of factors, including the percentage of bloat attributed to log backups prior to their most recent fulls, never mind the fact that different databases may be on completely different schedules). You can see where most of my logic came from, though, by looking at the definition for yourself:

EXEC msdb.sys.sp_helptext @objname = N'dbo.sp_delete_backuphistory';

Mysql – MariaDB server with 80K-100K DataBases

As already stated, the filesystem is the problem. However, there may be some tuning that would help.

If innodb_file_per_table has been ON, then there are 2*12 files in each database directory. Turning that OFF would lead to fewer files (but not fewer directories). This might help some.

SHOW VARIABLES LIKE 'table%';
SHOW VARIABLES LIKE 'innodb%files';
SHOW VARIABLES LIKE 'open%';
SHOW GLOBAL STATUS LIKE 'Opened%';
SHOW GLOBAL STATUS LIKE 'Uptime%';
SHOW GLOBAL STATUS LIKE 'Table%';

There is some info that can be gathered from those outputs.

Or, give me (1) amount of RAM, (2) SHOW VARIABLES;, and (3) SHOW GLOBAL STATUS;. I will run a number of checks to see if the table_open_cache is too small, and other things.

Best Answer

Related Solutions

Sql-server – Long running backups on small databases

Mysql – MariaDB server with 80K-100K DataBases

Related Question