Since the additional trace flags allowed you to narrow the delay down to the stage where the process tries to write to msdb
, I'm willing to bet one or both of these returns high numbers:
SELECT MAX(row_count), SUM(in_row_reserved_page_count)
FROM msdb.sys.dm_db_partition_stats
WHERE [object_id] = OBJECT_ID('msdb.dbo.backupset');
SELECT database_name, [type], c = COUNT(*)
FROM msdb.dbo.backupset
GROUP BY database_name, [type]
ORDER BY c DESC;
You can consider whittling away at some of the data in that table. Of course, only you can know if you have reasons to keep backup history that otherwise should probably be cleaned up in most cases (for example, the first rule I declare to identify junk history data is for databases that no longer exist - but what if you have renamed a database recently?). So please think about each of these rules and only delete backups where the rules make sense to you based on these scenarios. For example:
CREATE TABLE #bs(id INT, ms INT);
-- Databases that no longer exist on this server:
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset
WHERE database_name NOT IN (SELECT name FROM sys.databases);
-- All copy_only backups (because who cares?):
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset
WHERE is_copy_only = 1;
-- All diffs except the most recent:
;WITH x AS
(
SELECT backup_set_id, media_set_id,
rn = ROW_NUMBER() OVER (PARTITION BY database_name
ORDER BY backup_set_id DESC)
FROM msdb.dbo.backupset
WHERE [type] = 'I'
AND backup_finish_date IS NOT NULL
)
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM x WHERE rn > 1;
...
-- All log backups prior to the most recent full:
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM msdb.dbo.backupset AS l
WHERE [type] = 'L' AND EXISTS
(
SELECT 1
FROM msdb.dbo.backupset AS bs2
WHERE database_name = l.database_name
AND [type] = 'D'
AND backup_finish_date > l.backup_finish_date
);
-- All fulls except the most recent successful non-copy-only full backup:
;WITH x AS
(
SELECT backup_set_id, media_set_id,
rn = ROW_NUMBER() OVER (PARTITION BY database_name
ORDER BY backup_set_id DESC)
FROM msdb.dbo.backupset
WHERE [type] = 'D'
AND is_copy_only = 0
AND backup_finish_date IS NOT NULL
)
INSERT #bs(id,ms) SELECT backup_set_id, media_set_id
FROM x WHERE rn > 1;
Need to also delete some file and filegroup information:
DELETE msdb.dbo.backupfile
WHERE backup_set_id IN (SELECT id FROM #bs);
DELETE msdb.dbo.backupfilegroup
WHERE backup_set_id IN (SELECT id FROM #bs);
DELETE mf
FROM msdb.dbo.backupmediafamily AS mf
WHERE media_set_id IN (SELECT ms FROM #bs)
AND NOT EXISTS (SELECT 1 FROM msdb.dbo.backupset
WHERE media_set_id = mf.media_set_id);
DELETE ms
FROM msdb.dbo.backupmediaset AS ms
WHERE media_set_id IN (SELECT ms FROM #bs)
AND NOT EXISTS (SELECT 1 FROM msdb.dbo.backupset
WHERE media_set_id = ms.media_set_id);
You also need to delete associated restore history:
DELETE msdb.dbo.restorefilegroup WHERE restore_history_id IN
(SELECT restore_history_id FROM msdb.dbo.restorehistory
WHERE backup_set_id IN (SELECT id FROM #bs));
DELETE msdb.dbo.restorefile WHERE restore_history_id IN
(SELECT restore_history_id FROM msdb.dbo.restorehistory
WHERE backup_set_id IN (SELECT id FROM #bs));
DELETE msdb.dbo.restorehistory
WHERE backup_set_id IN (SELECT id FROM #bs);
And then finally the main rows can be deleted without violating any constraints:
DELETE msdb.dbo.backupset
WHERE backup_set_id IN (SELECT id FROM #bs);
DROP TABLE #bs;
This whole process will not likely be fast - I highly recommend doing it during a maintenance window or slow period, and most certainly suspend all backup jobs while doing this.
On Standard Edition there's no real clean way to do this as an online operation, but after you've deleted all those rows, you'll likely benefit most from a full rebuild:
ALTER INDEX ALL ON msdb.dbo.backupset REBUILD;
The above was based partially on the logic included in the system procedure dbo.sp_delete_backuphistory
, which leaves out some of the above criteria, has a reputation for locking up the entire set of backup- and restore-related msdb
tables, and only allows a singular filter on date (which may not be very helpful depending on a variety of factors, including the percentage of bloat attributed to log backups prior to their most recent fulls, never mind the fact that different databases may be on completely different schedules). You can see where most of my logic came from, though, by looking at the definition for yourself:
EXEC msdb.sys.sp_helptext @objname = N'dbo.sp_delete_backuphistory';
As already stated, the filesystem is the problem. However, there may be some tuning that would help.
If innodb_file_per_table
has been ON, then there are 2*12 files in each database directory. Turning that OFF would lead to fewer files (but not fewer directories). This might help some.
SHOW VARIABLES LIKE 'table%';
SHOW VARIABLES LIKE 'innodb%files';
SHOW VARIABLES LIKE 'open%';
SHOW GLOBAL STATUS LIKE 'Opened%';
SHOW GLOBAL STATUS LIKE 'Uptime%';
SHOW GLOBAL STATUS LIKE 'Table%';
There is some info that can be gathered from those outputs.
Or, give me (1) amount of RAM, (2) SHOW VARIABLES;
, and (3) SHOW GLOBAL STATUS;
. I will run a number of checks to see if the table_open_cache is too small, and other things.
Best Answer
Observations:
The More Important Issues:
Unless you have a lot of other apps on the server,
innodb_buffer_pool_size
should be increased, perhaps to 5G.There is pressure on the "table_open_cache", which is currently set to 2000; double that.
Increase
table_definition_cache
from 400 to 1000innodb_log_file_size
is rather small. However, making it larger is a pain for your old version, so I don't recommend at the moment.Are you using ARIA?
A lot of complex queries. See this for investigating them: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
You have a bunch of stored procedures? And some of them do PREPARE and EXECUTE? Some of them seem to fail to CLOSE (DEALLOCATE) them. This may lead to excessive memory usage. (I don't know if this could be causing the crash, but you should clean them up.) And consider
subquery_cache=off
(seeoptimizer_switch
)Details and other observations:
( Opened_tables ) = 2,098,690 / 430933 = 4.9 /sec
-- Frequency of opening Tables -- increase table_open_cache (now 2000)( Opened_table_definitions ) = 1,397,838 / 430933 = 3.2 /sec
-- Frequency of opening .frm files -- Increase table_definition_cache (now 400) and/or table_open_cache (now 2000).( innodb_lru_scan_depth ) = 1,024
-- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth( innodb_io_capacity_max / innodb_io_capacity ) = 2,000 / 200 = 10
-- Capacity: max/plain -- Recommend 2. Max should be about equal to the IOPs your I/O subsystem can handle. (If the drive type is unknown 2000/200 may be a reasonable pair.)( Innodb_pages_written / Innodb_buffer_pool_write_requests ) = 1,753,906 / 7374538 = 23.8%
-- Write requests that had to hit disk -- Check innodb_buffer_pool_size (now 1073741824)( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 737,214,976 / (430933 / 3600) / 2 / 48M = 0.0612
-- Ratio -- (see minutes)( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 430,933 / 60 * 48M / 737214976 = 490
-- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 50331648). (Cannot change in AWS.)( innodb_flush_method ) = innodb_flush_method =
-- How InnoDB should ask the OS to write blocks. Suggest O_DIRECT or O_ALL_DIRECT (Percona) to avoid double buffering. (At least for Unix.) See chrischandler for caveat about O_ALL_DIRECT( default_tmp_storage_engine ) = default_tmp_storage_engine =
( innodb_flush_neighbors ) = 1
-- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.( innodb_io_capacity ) = 200
-- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.( Innodb_deadlocks ) = 2 / 430933 = 0.017 /HR
-- Deadlocks -- SHOW ENGINE INNODB STATUS; to see the latest pair of queries that deadlocked.( sync_binlog ) = 0
-- Use 1 for added security, at some cost of I/O =1 may lead to lots of "query end"; =0 may lead to "binlog at impossible position" and lose transactions in a crash, but is faster.( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF
-- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.( innodb_buffer_pool_populate ) = OFF = 0
-- NUMA control( log_warnings ) = log_warnings = 3
( (Com_show_create_table + Com_show_fields) / Questions ) = (622768 + 622768) / 21607187 = 5.8%
-- Naughty framework -- spending a lot of effort rediscovering the schema. -- Complain to the 3rd party vendor.( local_infile ) = local_infile = ON
-- local_infile (now ON) = ON is a potential security issue( Qcache_lowmem_prunes/Qcache_inserts ) = 2,656,624/4159715 = 63.9%
-- Removal Ratio (frequency of needing to prune due to not enough memory)( (query_cache_size - Qcache_free_memory) / Qcache_queries_in_cache / query_alloc_block_size ) = (16M - 4531192) / 4235 / 16384 = 0.176
-- query_alloc_block_size vs formula -- Adjust query_alloc_block_size (now 16384)( Created_tmp_disk_tables ) = 1,309,758 / 430933 = 3 /sec
-- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216). Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.( Created_tmp_disk_tables / Questions ) = 1,309,758 / 21607187 = 6.1%
-- Pct of queries that needed on-disk tmp table. -- Better indexes / No blobs / etc.( Created_tmp_disk_tables / Created_tmp_tables ) = 1,309,758 / 2706252 = 48.4%
-- Percent of temp tables that spilled to disk -- Maybe increase tmp_table_size (now 16777216) and max_heap_table_size (now 16777216); improve indexes; avoid blobs, etc.( (Com_insert + Com_update + Com_delete + Com_replace) / Com_commit ) = (27132 + 306536 + 20780 + 1296) / 372094 = 0.956
-- Statements per Commit (assuming all InnoDB) -- Low: Might help to group queries together in transactions; High: long transactions strain various things.( Select_scan ) = 3,511,698 / 430933 = 8.1 /sec
-- full table scans -- Add indexes / optimize queries (unless they are tiny tables)( Select_scan / Com_select ) = 3,511,698 / 14426841 = 24.3%
-- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries( Com_stmt_prepare - Com_stmt_close ) = 12,457,121 - 12445287 = 11,834
-- How many prepared statements have not been closed. -- CLOSE prepared statements( binlog_format ) = binlog_format = STATEMENT
-- STATEMENT/ROW/MIXED. -- ROW is preferred by 5.7 (10.3)( slow_query_log ) = slow_query_log = OFF
-- Whether to log slow queries. (5.1.12)( long_query_time ) = 10
-- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2( Subquery_cache_hit / ( Subquery_cache_hit + Subquery_cache_miss ) ) = 356,216 / ( 356216 + 10183006 ) = 3.4%
-- Subquery cache hit rate( log_slow_slave_statements ) = log_slow_slave_statements = OFF
-- (5.6.11, 5.7.1) By default, replicated statements won't show up in the slowlog; this causes them to show. -- It can be helpful in the slowlog to see writes that could be interfering with Slave reads.Abnormally small:
Abnormally large:
Abnormal strings: