Mysql – Sudden MySQL freeze, with spike in dirty pages, but without obvious write queries

buffer-pooldata-pagesinnodbMySQLmysql-5.6

I'm systematically (1/2 times per day) experiencing a freeze in the DB server.

Suddenly, all the COMMITs get stuck in 'init' state; they pile up (up to 20/30), an almost no other queries are executed at all.
After ~30 seconds, the server resumes normal operations.

I've mined a lot of data (which follows), which shows that something is going on in InnoDB, but I struggle to find a specific cause.

In the processlist:

  • during the problem I see only COMMITs
  • before, there are no long-running operations

I've also gone through the binary log, but there is nothing suspicious:

  • during the problem, there is nothing being writte
  • before, only small queries

As a matter of fact, the binary log starting from shortly before the problem is rather small – being in ROW mode, if there would have been a huge update operation, I would expect it to be large.

I considered a problem with the disk, but the data doesn't fit the pattern of, for example, a sudden disk slowdown.

Which could be the cause? From this type of problem, I would expect a large write query, but I can't find any.

The system is a dedicated server, with 128GB of memory and Dual CPU Intel Xeon E5-2680 v2 @ 2.80GHz.
The storage is a mirror of Seagate 15k disks; the O/S is Ubuntu 14.04, and the files are stored in a separate ext4 partition.

MySQL is v5.6.24, with an allocated buffer pool of 24G.

The query cache is disabled.

Charts, global status and innodb variables follow, with some notes.

In all the charts, each X unit is a second. The problem approximate interval is highlighted by the vertical bars.

Note that I have all the processes, and global status variables, and innodb statuses, archived, for each second of the problem.

Innodb buffer pool pages

There are significant spikes in the dirty/free pages:

InnoDB dirty/free pages

And also in the buffer pool size:

InnoDB buffer pool size

The pages flushed have a very interesting behavior; the don't spike a lot, but InnoDB starts a continuous flushed; after the criticality, it skyrockets:

InnoDB page flushes

Innodb_buffer_pool_wait_free is always 0.

Log file

The is an increase in the data written to the log file, although not huge:

Log file writes

Disk operations

There are no significant change in pending operations (fsyncs/read/writes), but is a drop in the actual fsyncs:

Innodb fsyncs

Commands

Almost all the commands (Com_*) don't execute during the problem.
The following are the only ones executing, which clearly plunged:

Selects
Commits
Inserts
Updates

Current global status (partial)

+-----------------------------------------------+----------------+
| Aborted_clients | 4767955 |
| Aborted_connects | 33 |
| Binlog_cache_disk_use | 47747 |
| Binlog_cache_use | 68002226 |
| Binlog_stmt_cache_disk_use | 0 |
| Binlog_stmt_cache_use | 682187 |
| Bytes_received | 810930341088 |
| Bytes_sent | 6128156018408 |
| Com_admin_commands | 66190088 |
| Com_alter_table | 5837 |
| Com_analyze | 9742 |
| Com_begin | 328167299 |
| Com_change_db | 216 |
| Com_commit | 327839739 |
| Com_create_db | 13 |
| Com_create_index | 1 |
| Com_create_table | 877609 |
| Com_delete | 13720394 |
| Com_delete_multi | 89347 |
| Com_drop_db | 14 |
| Com_drop_table | 740674 |
| Com_insert | 130474570 |
| Com_insert_select | 572148 |
| Com_kill | 98 |
| Com_load | 1 |
| Com_lock_tables | 3 |
| Com_release_savepoint | 28061 |
| Com_rename_table | 978 |
| Com_replace | 7995547 |
| Com_replace_select | 46493 |
| Com_rollback | 284617 |
| Com_rollback_to_savepoint | 3 |
| Com_savepoint | 28062 |
| Com_select | 1664064423 |
| Com_set_option | 152369114 |
| Com_show_create_table | 64205947 |
| Com_show_databases | 11 |
| Com_show_engine_status | 1375041 |
| Com_show_fields | 97123409 |
| Com_show_keys | 1 |
| Com_show_master_status | 66 |
| Com_show_processlist | 2624788 |
| Com_show_slave_status | 1250533 |
| Com_show_status | 1591300 |
| Com_show_table_status | 198 |
| Com_show_tables | 90403059 |
| Com_show_triggers | 197 |
| Com_show_variables | 1580763 |
| Com_show_warnings | 3 |
| Com_unlock_tables | 4 |
| Com_update | 39673285 |
| Com_update_multi | 1259352 |
| Compression | OFF |
| Connections | 5426115 |
| Created_tmp_disk_tables | 108340111 |
| Created_tmp_files | 633100 |
| Created_tmp_tables | 350818717 |
| Delayed_errors | 0 |
| Delayed_insert_threads | 0 |
| Delayed_writes | 0 |
| Flush_commands | 1 |
| Handler_commit | 2607344684 |
| Handler_delete | 59374751 |
| Handler_discover | 0 |
| Handler_external_lock | 6579792721 |
| Handler_mrr_init | 0 |
| Handler_prepare | 379741472 |
| Handler_read_first | 122813781 |
| Handler_read_key | 300828885004 |
| Handler_read_last | 68546 |
| Handler_read_next | 1973974492391 |
| Handler_read_prev | 535277275 |
| Handler_read_rnd | 25166576467 |
| Handler_read_rnd_next | 403233196007 |
| Handler_rollback | 7441860 |
| Handler_savepoint | 56123 |
| Handler_savepoint_rollback | 4 |
| Handler_update | 7089849963 |
| Handler_write | 45088211806 |
| Innodb_buffer_pool_dump_status | not started |
| Innodb_buffer_pool_load_status | not started |
| Innodb_buffer_pool_pages_data | 1376429 |
| Innodb_buffer_pool_bytes_data | 22551412736 |
| Innodb_buffer_pool_pages_dirty | 54 |
| Innodb_buffer_pool_bytes_dirty | 884736 |
| Innodb_buffer_pool_pages_flushed | 511407857 |
| Innodb_buffer_pool_pages_free | 8196 |
| Innodb_buffer_pool_pages_misc | 188235 |
| Innodb_buffer_pool_pages_total | 1572860 |
| Innodb_buffer_pool_read_ahead_rnd | 0 |
| Innodb_buffer_pool_read_ahead | 43857258 |
| Innodb_buffer_pool_read_ahead_evicted | 1057 |
| Innodb_buffer_pool_read_requests | 3228343017942 |
| Innodb_buffer_pool_reads | 244620532 |
| Innodb_buffer_pool_wait_free | 0 |
| Innodb_buffer_pool_write_requests | 89857610256 |
| Innodb_data_fsyncs | 252778445 |
| Innodb_data_pending_fsyncs | 0 |
| Innodb_data_pending_reads | 0 |
| Innodb_data_pending_writes | 0 |
| Innodb_data_read | 4856234004480 |
| Innodb_data_reads | 300490939 |
| Innodb_data_writes | 688782200 |
| Innodb_data_written | 20083187664896 |
| Innodb_dblwr_pages_written | 511407857 |
| Innodb_dblwr_writes | 80028802 |
| Innodb_have_atomic_builtins | ON |
| Innodb_log_waits | 83 |
| Innodb_log_write_requests | 6576724198 |
| Innodb_log_writes | 91699110 |
| Innodb_os_log_fsyncs | 92727019 |
| Innodb_os_log_pending_fsyncs | 0 |
| Innodb_os_log_pending_writes | 0 |
| Innodb_os_log_written | 3324697918464 |
| Innodb_page_size | 16384 |
| Innodb_pages_created | 131874643 |
| Innodb_pages_read | 296401190 |
| Innodb_pages_written | 511407857 |
| Innodb_row_lock_current_waits | 5 |
| Innodb_row_lock_time | 193034202 |
| Innodb_row_lock_time_avg | 607 |
| Innodb_row_lock_time_max | 121631 |
| Innodb_row_lock_waits | 317570 |
| Innodb_rows_deleted | 59374751 |
| Innodb_rows_inserted | 16430811966 |
| Innodb_rows_read | 2466321863882 |
| Innodb_rows_updated | 108284843 |
| Innodb_num_open_files | 4 |
| Innodb_truncated_status_writes | 0 |
| Innodb_available_undo_logs | 128 |
| Key_blocks_not_flushed | 0 |
| Key_blocks_unused | 857368 |
| Key_blocks_used | 6054 |
| Key_read_requests | 2074128816 |
| Key_reads | 0 |
| Key_write_requests | 292337091 |
| Key_writes | 0 |
| Last_query_cost | 0.000000 |
| Last_query_partial_plans | 0 |
| Max_used_connections | 157 |
| Not_flushed_delayed_rows | 0 |
| Open_files | 24 |
| Open_streams | 0 |
| Open_table_definitions | 352 |
| Open_tables | 528 |
| Opened_files | 437024702 |
| Opened_table_definitions | 1775335 |
| Opened_tables | 914419 |
| Prepared_stmt_count | 0 |
| Queries | 4907804173 |
| Questions | 4841613101 |
| Select_full_join | 46349 |
| Select_full_range_join | 82799 |
| Select_range | 266986250 |
| Select_range_check | 536 |
| Select_scan | 304460971 |
| Slow_launch_threads | 0 |
| Slow_queries | 144689 |
| Sort_merge_passes | 3055697 |
| Sort_range | 119144232 |
| Sort_rows | 30616561566 |
| Sort_scan | 115294967 |
| Table_locks_immediate | 3285742243 |
| Table_locks_waited | 0 |
| Table_open_cache_hits | 3445246883 |
| Table_open_cache_misses | 169145 |
| Table_open_cache_overflows | 0 |
| Threads_cached | 43 |
| Threads_connected | 64 |
| Threads_created | 278 |
| Threads_running | 2 |
| Uptime | 8670805 |
| Uptime_since_flush_status | 8670805 |
+-----------------------------------------------+----------------+

Innodb variables

+------------------------------------------+------------------------+
| Variable_name | Value |
+------------------------------------------+------------------------+
| innodb_adaptive_flushing | ON |
| innodb_adaptive_flushing_lwm | 10 |
| innodb_adaptive_hash_index | ON |
| innodb_adaptive_max_sleep_delay | 150000 |
| innodb_additional_mem_pool_size | 8388608 |
| innodb_api_bk_commit_interval | 5 |
| innodb_api_disable_rowlock | OFF |
| innodb_api_enable_binlog | OFF |
| innodb_api_enable_mdl | OFF |
| innodb_api_trx_level | 0 |
| innodb_autoextend_increment | 64 |
| innodb_autoinc_lock_mode | 1 |
| innodb_buffer_pool_dump_at_shutdown | OFF |
| innodb_buffer_pool_dump_now | OFF |
| innodb_buffer_pool_filename | <removed> |
| innodb_buffer_pool_instances | 8 |
| innodb_buffer_pool_load_abort | OFF |
| innodb_buffer_pool_load_at_startup | OFF |
| innodb_buffer_pool_load_now | OFF |
| innodb_buffer_pool_size | 25769803776 |
| innodb_change_buffer_max_size | 25 |
| innodb_change_buffering | all |
| innodb_checksum_algorithm | innodb |
| innodb_checksums | ON |
| innodb_cmp_per_index_enabled | OFF |
| innodb_commit_concurrency | 0 |
| innodb_compression_failure_threshold_pct | 5 |
| innodb_compression_level | 6 |
| innodb_compression_pad_pct_max | 50 |
| innodb_concurrency_tickets | 5000 |
| innodb_data_file_path | ibdata1:12M:autoextend |
| innodb_data_home_dir | <removed> |
| innodb_disable_sort_file_cache | OFF |
| innodb_doublewrite | ON |
| innodb_fast_shutdown | 1 |
| innodb_file_format | Antelope |
| innodb_file_format_check | ON |
| innodb_file_format_max | Antelope |
| innodb_file_per_table | OFF |
| innodb_flush_log_at_timeout | 1 |
| innodb_flush_log_at_trx_commit | 1 |
| innodb_flush_method | O_DIRECT |
| innodb_flush_neighbors | 1 |
| innodb_flushing_avg_loops | 30 |
| innodb_force_load_corrupted | OFF |
| innodb_force_recovery | 0 |
| innodb_ft_aux_table | |
| innodb_ft_cache_size | 8000000 |
| innodb_ft_enable_diag_print | OFF |
| innodb_ft_enable_stopword | ON |
| innodb_ft_max_token_size | 84 |
| innodb_ft_min_token_size | 3 |
| innodb_ft_num_word_optimize | 2000 |
| innodb_ft_result_cache_limit | 2000000000 |
| innodb_ft_server_stopword_table | |
| innodb_ft_sort_pll_degree | 2 |
| innodb_ft_total_cache_size | 640000000 |
| innodb_ft_user_stopword_table | |
| innodb_io_capacity | 200 |
| innodb_io_capacity_max | 2000 |
| innodb_large_prefix | OFF |
| innodb_lock_wait_timeout | 120 |
| innodb_locks_unsafe_for_binlog | OFF |
| innodb_log_buffer_size | 8388608 |
| innodb_log_compressed_pages | ON |
| innodb_log_file_size | 134217728 |
| innodb_log_files_in_group | 3 |
| innodb_log_group_home_dir | <removed> |
| innodb_lru_scan_depth | 1024 |
| innodb_max_dirty_pages_pct | 75 |
| innodb_max_dirty_pages_pct_lwm | 0 |
| innodb_max_purge_lag | 0 |
| innodb_max_purge_lag_delay | 0 |
| innodb_mirrored_log_groups | 1 |
| innodb_monitor_disable | |
| innodb_monitor_enable | |
| innodb_monitor_reset | |
| innodb_monitor_reset_all | |
| innodb_old_blocks_pct | 37 |
| innodb_old_blocks_time | 1000 |
| innodb_online_alter_log_max_size | 134217728 |
| innodb_open_files | 300 |
| innodb_optimize_fulltext_only | OFF |
| innodb_page_size | 16384 |
| innodb_print_all_deadlocks | OFF |
| innodb_purge_batch_size | 300 |
| innodb_purge_threads | 1 |
| innodb_random_read_ahead | OFF |
| innodb_read_ahead_threshold | 56 |
| innodb_read_io_threads | 4 |
| innodb_read_only | OFF |
| innodb_replication_delay | 0 |
| innodb_rollback_on_timeout | OFF |
| innodb_rollback_segments | 128 |
| innodb_sort_buffer_size | 1048576 |
| innodb_spin_wait_delay | 6 |
| innodb_stats_auto_recalc | ON |
| innodb_stats_method | nulls_equal |
| innodb_stats_on_metadata | OFF |
| innodb_stats_persistent | ON |
| innodb_stats_persistent_sample_pages | 20 |
| innodb_stats_sample_pages | 8 |
| innodb_stats_transient_sample_pages | 8 |
| innodb_status_output | OFF |
| innodb_status_output_locks | OFF |
| innodb_strict_mode | OFF |
| innodb_support_xa | ON |
| innodb_sync_array_size | 1 |
| innodb_sync_spin_loops | 30 |
| innodb_table_locks | ON |
| innodb_thread_concurrency | 0 |
| innodb_thread_sleep_delay | 10000 |
| innodb_undo_directory | . |
| innodb_undo_logs | 128 |
| innodb_undo_tablespaces | 0 |
| innodb_use_native_aio | ON |
| innodb_use_sys_malloc | ON |
| innodb_version | 5.6.24 |
| innodb_write_io_threads | 4 |
+------------------------------------------+------------------------+

Best Answer

You may need to tune your writes for InnoDB and possibly your ext4 volume

ASPECT #1

I noticed you have innodb_write_io_threads set to 4 (default value). You need to increase that so dirty pages can get flushed to their respective .ibd files more robustly. Please set it 16.

ASPECT #2

That pause to perform write might be do to ext4. Why ?

You have innodb_flush_method set to O_DIRECT. This is supposed to make disk writes more stable. Nevertheless, I a surprise for you. I answered ib_logfile opened with O_SYNC when innodb_flush_method=O_DSYNC about a year ago. I mentioned a Percona blog that says that O_DIRECT is faked in ext4 with the latest kernel.

  1. You need to make sure you have the latest Linux kernel (3.2 or greater)
  2. Change innodb_flush_log_at_trx_commit to 0