I'm systematically (1/2 times per day) experiencing a freeze in the DB server.
Suddenly, all the COMMITs get stuck in 'init' state; they pile up (up to 20/30), an almost no other queries are executed at all.
After ~30 seconds, the server resumes normal operations.
I've mined a lot of data (which follows), which shows that something is going on in InnoDB, but I struggle to find a specific cause.
In the processlist:
- during the problem I see only COMMITs
- before, there are no long-running operations
I've also gone through the binary log, but there is nothing suspicious:
- during the problem, there is nothing being writte
- before, only small queries
As a matter of fact, the binary log starting from shortly before the problem is rather small – being in ROW mode, if there would have been a huge update operation, I would expect it to be large.
I considered a problem with the disk, but the data doesn't fit the pattern of, for example, a sudden disk slowdown.
Which could be the cause? From this type of problem, I would expect a large write query, but I can't find any.
The system is a dedicated server, with 128GB of memory and Dual CPU Intel Xeon E5-2680 v2 @ 2.80GHz.
The storage is a mirror of Seagate 15k disks; the O/S is Ubuntu 14.04, and the files are stored in a separate ext4 partition.
MySQL is v5.6.24, with an allocated buffer pool of 24G.
The query cache is disabled.
Charts, global status and innodb variables follow, with some notes.
In all the charts, each X unit is a second. The problem approximate interval is highlighted by the vertical bars.
Note that I have all the processes, and global status variables, and innodb statuses, archived, for each second of the problem.
Innodb buffer pool pages
There are significant spikes in the dirty/free pages:
And also in the buffer pool size:
The pages flushed have a very interesting behavior; the don't spike a lot, but InnoDB starts a continuous flushed; after the criticality, it skyrockets:
Innodb_buffer_pool_wait_free
is always 0.
Log file
The is an increase in the data written to the log file, although not huge:
Disk operations
There are no significant change in pending operations (fsyncs/read/writes), but is a drop in the actual fsyncs:
Commands
Almost all the commands (Com_*) don't execute during the problem.
The following are the only ones executing, which clearly plunged:
Current global status (partial)
+-----------------------------------------------+----------------+
| Aborted_clients | 4767955 |
| Aborted_connects | 33 |
| Binlog_cache_disk_use | 47747 |
| Binlog_cache_use | 68002226 |
| Binlog_stmt_cache_disk_use | 0 |
| Binlog_stmt_cache_use | 682187 |
| Bytes_received | 810930341088 |
| Bytes_sent | 6128156018408 |
| Com_admin_commands | 66190088 |
| Com_alter_table | 5837 |
| Com_analyze | 9742 |
| Com_begin | 328167299 |
| Com_change_db | 216 |
| Com_commit | 327839739 |
| Com_create_db | 13 |
| Com_create_index | 1 |
| Com_create_table | 877609 |
| Com_delete | 13720394 |
| Com_delete_multi | 89347 |
| Com_drop_db | 14 |
| Com_drop_table | 740674 |
| Com_insert | 130474570 |
| Com_insert_select | 572148 |
| Com_kill | 98 |
| Com_load | 1 |
| Com_lock_tables | 3 |
| Com_release_savepoint | 28061 |
| Com_rename_table | 978 |
| Com_replace | 7995547 |
| Com_replace_select | 46493 |
| Com_rollback | 284617 |
| Com_rollback_to_savepoint | 3 |
| Com_savepoint | 28062 |
| Com_select | 1664064423 |
| Com_set_option | 152369114 |
| Com_show_create_table | 64205947 |
| Com_show_databases | 11 |
| Com_show_engine_status | 1375041 |
| Com_show_fields | 97123409 |
| Com_show_keys | 1 |
| Com_show_master_status | 66 |
| Com_show_processlist | 2624788 |
| Com_show_slave_status | 1250533 |
| Com_show_status | 1591300 |
| Com_show_table_status | 198 |
| Com_show_tables | 90403059 |
| Com_show_triggers | 197 |
| Com_show_variables | 1580763 |
| Com_show_warnings | 3 |
| Com_unlock_tables | 4 |
| Com_update | 39673285 |
| Com_update_multi | 1259352 |
| Compression | OFF |
| Connections | 5426115 |
| Created_tmp_disk_tables | 108340111 |
| Created_tmp_files | 633100 |
| Created_tmp_tables | 350818717 |
| Delayed_errors | 0 |
| Delayed_insert_threads | 0 |
| Delayed_writes | 0 |
| Flush_commands | 1 |
| Handler_commit | 2607344684 |
| Handler_delete | 59374751 |
| Handler_discover | 0 |
| Handler_external_lock | 6579792721 |
| Handler_mrr_init | 0 |
| Handler_prepare | 379741472 |
| Handler_read_first | 122813781 |
| Handler_read_key | 300828885004 |
| Handler_read_last | 68546 |
| Handler_read_next | 1973974492391 |
| Handler_read_prev | 535277275 |
| Handler_read_rnd | 25166576467 |
| Handler_read_rnd_next | 403233196007 |
| Handler_rollback | 7441860 |
| Handler_savepoint | 56123 |
| Handler_savepoint_rollback | 4 |
| Handler_update | 7089849963 |
| Handler_write | 45088211806 |
| Innodb_buffer_pool_dump_status | not started |
| Innodb_buffer_pool_load_status | not started |
| Innodb_buffer_pool_pages_data | 1376429 |
| Innodb_buffer_pool_bytes_data | 22551412736 |
| Innodb_buffer_pool_pages_dirty | 54 |
| Innodb_buffer_pool_bytes_dirty | 884736 |
| Innodb_buffer_pool_pages_flushed | 511407857 |
| Innodb_buffer_pool_pages_free | 8196 |
| Innodb_buffer_pool_pages_misc | 188235 |
| Innodb_buffer_pool_pages_total | 1572860 |
| Innodb_buffer_pool_read_ahead_rnd | 0 |
| Innodb_buffer_pool_read_ahead | 43857258 |
| Innodb_buffer_pool_read_ahead_evicted | 1057 |
| Innodb_buffer_pool_read_requests | 3228343017942 |
| Innodb_buffer_pool_reads | 244620532 |
| Innodb_buffer_pool_wait_free | 0 |
| Innodb_buffer_pool_write_requests | 89857610256 |
| Innodb_data_fsyncs | 252778445 |
| Innodb_data_pending_fsyncs | 0 |
| Innodb_data_pending_reads | 0 |
| Innodb_data_pending_writes | 0 |
| Innodb_data_read | 4856234004480 |
| Innodb_data_reads | 300490939 |
| Innodb_data_writes | 688782200 |
| Innodb_data_written | 20083187664896 |
| Innodb_dblwr_pages_written | 511407857 |
| Innodb_dblwr_writes | 80028802 |
| Innodb_have_atomic_builtins | ON |
| Innodb_log_waits | 83 |
| Innodb_log_write_requests | 6576724198 |
| Innodb_log_writes | 91699110 |
| Innodb_os_log_fsyncs | 92727019 |
| Innodb_os_log_pending_fsyncs | 0 |
| Innodb_os_log_pending_writes | 0 |
| Innodb_os_log_written | 3324697918464 |
| Innodb_page_size | 16384 |
| Innodb_pages_created | 131874643 |
| Innodb_pages_read | 296401190 |
| Innodb_pages_written | 511407857 |
| Innodb_row_lock_current_waits | 5 |
| Innodb_row_lock_time | 193034202 |
| Innodb_row_lock_time_avg | 607 |
| Innodb_row_lock_time_max | 121631 |
| Innodb_row_lock_waits | 317570 |
| Innodb_rows_deleted | 59374751 |
| Innodb_rows_inserted | 16430811966 |
| Innodb_rows_read | 2466321863882 |
| Innodb_rows_updated | 108284843 |
| Innodb_num_open_files | 4 |
| Innodb_truncated_status_writes | 0 |
| Innodb_available_undo_logs | 128 |
| Key_blocks_not_flushed | 0 |
| Key_blocks_unused | 857368 |
| Key_blocks_used | 6054 |
| Key_read_requests | 2074128816 |
| Key_reads | 0 |
| Key_write_requests | 292337091 |
| Key_writes | 0 |
| Last_query_cost | 0.000000 |
| Last_query_partial_plans | 0 |
| Max_used_connections | 157 |
| Not_flushed_delayed_rows | 0 |
| Open_files | 24 |
| Open_streams | 0 |
| Open_table_definitions | 352 |
| Open_tables | 528 |
| Opened_files | 437024702 |
| Opened_table_definitions | 1775335 |
| Opened_tables | 914419 |
| Prepared_stmt_count | 0 |
| Queries | 4907804173 |
| Questions | 4841613101 |
| Select_full_join | 46349 |
| Select_full_range_join | 82799 |
| Select_range | 266986250 |
| Select_range_check | 536 |
| Select_scan | 304460971 |
| Slow_launch_threads | 0 |
| Slow_queries | 144689 |
| Sort_merge_passes | 3055697 |
| Sort_range | 119144232 |
| Sort_rows | 30616561566 |
| Sort_scan | 115294967 |
| Table_locks_immediate | 3285742243 |
| Table_locks_waited | 0 |
| Table_open_cache_hits | 3445246883 |
| Table_open_cache_misses | 169145 |
| Table_open_cache_overflows | 0 |
| Threads_cached | 43 |
| Threads_connected | 64 |
| Threads_created | 278 |
| Threads_running | 2 |
| Uptime | 8670805 |
| Uptime_since_flush_status | 8670805 |
+-----------------------------------------------+----------------+
Innodb variables
+------------------------------------------+------------------------+
| Variable_name | Value |
+------------------------------------------+------------------------+
| innodb_adaptive_flushing | ON |
| innodb_adaptive_flushing_lwm | 10 |
| innodb_adaptive_hash_index | ON |
| innodb_adaptive_max_sleep_delay | 150000 |
| innodb_additional_mem_pool_size | 8388608 |
| innodb_api_bk_commit_interval | 5 |
| innodb_api_disable_rowlock | OFF |
| innodb_api_enable_binlog | OFF |
| innodb_api_enable_mdl | OFF |
| innodb_api_trx_level | 0 |
| innodb_autoextend_increment | 64 |
| innodb_autoinc_lock_mode | 1 |
| innodb_buffer_pool_dump_at_shutdown | OFF |
| innodb_buffer_pool_dump_now | OFF |
| innodb_buffer_pool_filename | <removed> |
| innodb_buffer_pool_instances | 8 |
| innodb_buffer_pool_load_abort | OFF |
| innodb_buffer_pool_load_at_startup | OFF |
| innodb_buffer_pool_load_now | OFF |
| innodb_buffer_pool_size | 25769803776 |
| innodb_change_buffer_max_size | 25 |
| innodb_change_buffering | all |
| innodb_checksum_algorithm | innodb |
| innodb_checksums | ON |
| innodb_cmp_per_index_enabled | OFF |
| innodb_commit_concurrency | 0 |
| innodb_compression_failure_threshold_pct | 5 |
| innodb_compression_level | 6 |
| innodb_compression_pad_pct_max | 50 |
| innodb_concurrency_tickets | 5000 |
| innodb_data_file_path | ibdata1:12M:autoextend |
| innodb_data_home_dir | <removed> |
| innodb_disable_sort_file_cache | OFF |
| innodb_doublewrite | ON |
| innodb_fast_shutdown | 1 |
| innodb_file_format | Antelope |
| innodb_file_format_check | ON |
| innodb_file_format_max | Antelope |
| innodb_file_per_table | OFF |
| innodb_flush_log_at_timeout | 1 |
| innodb_flush_log_at_trx_commit | 1 |
| innodb_flush_method | O_DIRECT |
| innodb_flush_neighbors | 1 |
| innodb_flushing_avg_loops | 30 |
| innodb_force_load_corrupted | OFF |
| innodb_force_recovery | 0 |
| innodb_ft_aux_table | |
| innodb_ft_cache_size | 8000000 |
| innodb_ft_enable_diag_print | OFF |
| innodb_ft_enable_stopword | ON |
| innodb_ft_max_token_size | 84 |
| innodb_ft_min_token_size | 3 |
| innodb_ft_num_word_optimize | 2000 |
| innodb_ft_result_cache_limit | 2000000000 |
| innodb_ft_server_stopword_table | |
| innodb_ft_sort_pll_degree | 2 |
| innodb_ft_total_cache_size | 640000000 |
| innodb_ft_user_stopword_table | |
| innodb_io_capacity | 200 |
| innodb_io_capacity_max | 2000 |
| innodb_large_prefix | OFF |
| innodb_lock_wait_timeout | 120 |
| innodb_locks_unsafe_for_binlog | OFF |
| innodb_log_buffer_size | 8388608 |
| innodb_log_compressed_pages | ON |
| innodb_log_file_size | 134217728 |
| innodb_log_files_in_group | 3 |
| innodb_log_group_home_dir | <removed> |
| innodb_lru_scan_depth | 1024 |
| innodb_max_dirty_pages_pct | 75 |
| innodb_max_dirty_pages_pct_lwm | 0 |
| innodb_max_purge_lag | 0 |
| innodb_max_purge_lag_delay | 0 |
| innodb_mirrored_log_groups | 1 |
| innodb_monitor_disable | |
| innodb_monitor_enable | |
| innodb_monitor_reset | |
| innodb_monitor_reset_all | |
| innodb_old_blocks_pct | 37 |
| innodb_old_blocks_time | 1000 |
| innodb_online_alter_log_max_size | 134217728 |
| innodb_open_files | 300 |
| innodb_optimize_fulltext_only | OFF |
| innodb_page_size | 16384 |
| innodb_print_all_deadlocks | OFF |
| innodb_purge_batch_size | 300 |
| innodb_purge_threads | 1 |
| innodb_random_read_ahead | OFF |
| innodb_read_ahead_threshold | 56 |
| innodb_read_io_threads | 4 |
| innodb_read_only | OFF |
| innodb_replication_delay | 0 |
| innodb_rollback_on_timeout | OFF |
| innodb_rollback_segments | 128 |
| innodb_sort_buffer_size | 1048576 |
| innodb_spin_wait_delay | 6 |
| innodb_stats_auto_recalc | ON |
| innodb_stats_method | nulls_equal |
| innodb_stats_on_metadata | OFF |
| innodb_stats_persistent | ON |
| innodb_stats_persistent_sample_pages | 20 |
| innodb_stats_sample_pages | 8 |
| innodb_stats_transient_sample_pages | 8 |
| innodb_status_output | OFF |
| innodb_status_output_locks | OFF |
| innodb_strict_mode | OFF |
| innodb_support_xa | ON |
| innodb_sync_array_size | 1 |
| innodb_sync_spin_loops | 30 |
| innodb_table_locks | ON |
| innodb_thread_concurrency | 0 |
| innodb_thread_sleep_delay | 10000 |
| innodb_undo_directory | . |
| innodb_undo_logs | 128 |
| innodb_undo_tablespaces | 0 |
| innodb_use_native_aio | ON |
| innodb_use_sys_malloc | ON |
| innodb_version | 5.6.24 |
| innodb_write_io_threads | 4 |
+------------------------------------------+------------------------+
Best Answer
You may need to tune your writes for InnoDB and possibly your ext4 volume
ASPECT #1
I noticed you have innodb_write_io_threads set to 4 (default value). You need to increase that so dirty pages can get flushed to their respective
.ibd
files more robustly. Please set it 16.ASPECT #2
That pause to perform write might be do to ext4. Why ?
You have innodb_flush_method set to
O_DIRECT
. This is supposed to make disk writes more stable. Nevertheless, I a surprise for you. I answered ib_logfile opened with O_SYNC when innodb_flush_method=O_DSYNC about a year ago. I mentioned a Percona blog that says that O_DIRECT is faked in ext4 with the latest kernel.