Mysql – Which AWS RDS; current t3.medium over CPU baseline, but not sure what’s the next best option

amazon-rdsamazon-rds-auroraMySQL

I currently have one RDS instance, db.t3.medium running Aurora MySQL.

Due to increase workload (expected), the CPU is, on average running just over its baseline and therefore in about 10 days will run out of CPU credits and then start getting expensive. The 15-minute average CPU Utilization is around 22% (baseline is 20%).

RDS doesn't support any larger db.t3. versions for Aurora MySQL so I either need to change instance class, or move to Aurora Serverless, or look at a second instance alongside. I'm not sure what's going to be the best move so keen to get some advice.

Second Instance

Up until now I've avoided running a second db instance because:

  1. I don't fully understand how to set it up correctly to share the workload (adding different endpoints for read vs write in my applications would be tricky)
  2. I'm not familiar if this would actually share the load like an EC2 load balancer does
  3. most of my MySQL CPU usage is from reading (relatively little write), so I don't predict splitting it to a reader and a writer helping too much (the reader will probably still run high). But my understanding around how this all works is patchy.

I'm aware of the pitfalls of a single instance around availability/fallover etc so it would be nice to mitigate this as a bonus from this upgrade.

Switch Instance Type

If I switch instance types, a db.r5.large is twice the cost of a db.t3.medium but still only has 2 x vCPU. It has a lot more Memory (16 v 4) but my issue is CPU.

I'm not sure how the ECU (the actual relative measure of processing power) compares on db.r5.large vs db.t3.medium as it's documented as 'variable', so not sure if I'm comparing apples with oranges between the r5 and the t3.

  • db.r5.large ECU = 10
  • db.t3.medium ECU = Varaible

The db.r5.xlarge has 4 vCPU but would quadruple my RDS costs.

Sidenote:
I have 9 months left on a t3.medium reserved instance, so would lose that if I do anything other than setup a second t3 instance.

Serverless

Based on the fairly vague documentation around serverless resources (1 ACU has approximately 2 GB of memory with corresponding CPU and networking, similar to what is used in Aurora user-provisioned instances.), I would guestimate 2 ACUs is the equivalent of a db.t3.medium, so 4 ACUs would give me double the CPU as my current setup, for roughly the same cost as a db.r5.large.

Serverless would throw in the added benefit of built-in availability and replication.

The RDS load is fairly steady, rather than spiky, so I wouldn't benefit from any time the Serverless scales down to 1 ACU for example.

What's best?

What would you do, and why?!

Thanks!

Additional Info

Best Answer

Analysis of STATUS and VARIABLES:

Observations:

  • Version: 5.6.10
  • 4 GB of RAM
  • Uptime = 68d 06:45:29
  • You are not running on Windows.
  • Running 64-bit version
  • You appear to be running entirely (or mostly) InnoDB.

The More Important Issues:

innodb_buffer_pool_size could be increased, but not so much as to cause swapping. (If you don't have much data, then increasing it won't make any difference.)

If the disk is SSD, increase innodb_io_capacity to 500. (The I/O seems low, so this may not have any impact.)

max_allowed_packet is 12% of RAM; suggest lowering it to 1% of RAM unless you need a huge packet size.

Some things indicate very little I/O, as most of the work is in the CPU. This further pushes toward studying the slow queries.

The Query cache (query_cache_type) is ON. And there is some indication of heavy activity with it. This may be adding to the CPU for "pruning" or it may be helping with the CPU. (I can't tell which.) Suggest you change to DEMAND; at the same time, add SQL_CACHE to queries that are likely to benefit from the QC and SQL_NO_CACHE to the others. This goes right after SELECT. These changes will make the QC more focused, thereby (hopefully) decreasing the CPU usage on both sides.

If you turn on the slowlog (which I recommend), be sure to lower long_query_time to, say 1.

Details and other observations:

( Key_blocks_used * 1024 / key_buffer_size ) = 22 * 1024 / 16M = 0.13% -- Percent of key_buffer used. High-water-mark. -- Lower key_buffer_size (now 16777216) to avoid unnecessary memory usage.

( (key_buffer_size / 0.20 + innodb_buffer_pool_size / 0.70) ) = ((16M / 0.20 + 1513M / 0.70)) / 4096M = 54.7% -- Most of available ram should be made available for caching. -- http://mysql.rjweb.org/doc.php/memory

( innodb_lru_scan_depth ) = 1,024 -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

( innodb_io_capacity_max / innodb_io_capacity ) = 2,000 / 200 = 10 -- Capacity: max/plain -- Recommend 2. Max should be about equal to the IOPs your I/O subsystem can handle. (If the drive type is unknown 2000/200 may be a reasonable pair.)

( innodb_change_buffering ) = innodb_change_buffering = none -- Pre-5.6.11 / 5.5.31, there was a bug that made ="changes" a safer option.

( innodb_doublewrite ) = innodb_doublewrite = OFF -- Extra I/O, but extra safety in crash. -- OFF is OK for FusionIO, Galera, Replicas, ZFS.

( Innodb_os_log_written ) = 512 / 5899529 = 0.31 /HR -- This is an indicator of how busy InnoDB is. -- Very idle InnoDB.

( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 512 / (5899529 / 3600) / 2 / 48M = 3.1e-9 -- Ratio -- (see minutes)

( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 5,899,529 / 60 * 48M / 512 = 9.67e+9 -- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 50331648). (Cannot change in AWS.)

( Handler_rollback ) = 68,089,346 / 5899529 = 12 /sec -- Why so many rollbacks?

( innodb_flush_neighbors ) = 1 -- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.

( innodb_io_capacity ) = 200 -- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.

( innodb_adaptive_hash_index ) = innodb_adaptive_hash_index = OFF -- Usually should be ON. -- There are cases where OFF is better. See also innodb_adaptive_hash_index_parts (after 5.7.9) and innodb_adaptive_hash_index_partitions (MariaDB and Percona). ON has been implicated in rare crashes (bug 73890).

( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

( max_allowed_packet ) = 512M / 4096M = 12.5% -- If you do not have large blobs (etc) to load, then decrease the value. Else decrease innodb_buffer_pool_size (now 1586495488) to make room. Swapping is terrible for performance.

( innodb_ft_result_cache_limit ) = 2,000,000,000 / 4096M = 46.6% -- Byte limit on FULLTEXT resultset. (Possibly not preallocated, but grows?) -- Lower the setting.

( character_set_server ) = character_set_server = latin1 -- Charset problems may be helped by setting character_set_server (now latin1) to utf8mb4. That is the future default.

( local_infile ) = local_infile = ON -- local_infile (now ON) = ON is a potential security issue

( Qcache_lowmem_prunes ) = 142,717,868 / 5899529 = 24 /sec -- Running out of room in QC -- increase query_cache_size (now 88158208)

( Qcache_lowmem_prunes/Qcache_inserts ) = 142,717,868/210841277 = 67.7% -- Removal Ratio (frequency of needing to prune due to not enough memory)

( Qcache_not_cached ) = 624,955,002 / 5899529 = 105 /sec -- SQL_CACHE attempted, but ignored -- Rethink caching; tune qcache

( Qcache_not_cached / (Qcache_hits + Com_select + Qcache_not_cached) ) = 624,955,002 / (608264262 + 835804146 + 624955002) = 30.2% -- Percent of SELECTs that were not cached in the QC. -- QC is not very useful.

( (query_cache_size - Qcache_free_memory) / Qcache_queries_in_cache / query_alloc_block_size ) = (88158208 - 1227312) / 34276 / 8192 = 0.31 -- query_alloc_block_size vs formula -- Adjust query_alloc_block_size (now 8192)

( Select_scan ) = 171,787,477 / 5899529 = 29 /sec -- full table scans -- Add indexes / optimize queries (unless they are tiny tables)

( Select_scan / Com_select ) = 171,787,477 / 835804146 = 20.6% -- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries

( relay_log_space_limit ) = 1,000,000,000 = 953.7MB -- The max total size for relay logs on a Replica. (0=unlimited) -- Let's discuss the rationale for having a limit.

( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

( long_query_time ) = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

Abnormally small:

( Innodb_pages_read + Innodb_pages_written ) / Uptime = 0.0118
Innodb_buffer_pool_bytes_data = 262 /sec
Innodb_buffer_pool_pages_flushed / max(Questions, Queries) = 0
Innodb_buffer_pool_pages_misc = 0
Innodb_buffer_pool_pages_misc * 16384 / innodb_buffer_pool_size = 0
Innodb_data_fsyncs = 0
Innodb_data_read = 192 /sec
Innodb_data_reads = 1.3 /HR
Innodb_data_writes = 1.3 /HR
Innodb_data_writes - Innodb_log_writes - Innodb_dblwr_writes = 1.3 /HR
Innodb_data_written = 0
Innodb_dblwr_pages_written = 0
Innodb_log_write_requests = 0
Innodb_os_log_fsyncs = 0
Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group = 0.0MB
Innodb_pages_read = 42 /HR
Innodb_pages_read + Innodb_pages_written = 42 /HR
Innodb_pages_written = 0
host_cache_size = 128

Abnormally large:

(query_cache_size - Qcache_free_memory) / query_cache_size = 98.6%
1 - Qcache_free_memory / query_cache_size = 98.6%
Com_create_trigger = 0.00061 /HR
Com_drop_trigger = 0.00061 /HR
Com_empty_query = 0.06 /HR
Com_flush = 21 /HR
Com_purge_before_date = 12 /HR
Com_rename_user = 0.00061 /HR
Com_revoke = 0.0012 /HR
Com_show_tables = 1.6 /sec
Handler_read_last = 1.6 /sec
Innodb_buffer_pool_read_requests / (Innodb_buffer_pool_read_requests + Innodb_buffer_pool_reads ) = 100.0%
Qcache_total_blocks = 118,191
Qcache_total_blocks * query_cache_min_res_unit / Qcache_queries_in_cache = 14,123
Ssl_accepts = 172
Ssl_default_timeout = 7,200
Ssl_finished_accepts = 166
Ssl_session_cache_misses = 172
Ssl_verify_depth = 1.84e+19
Ssl_verify_mode = 5
back_log / max_connections = 75.6%
innodb_stats_persistent_sample_pages = 128
table_definition_cache = 5,383

Abnormal strings:

core_file = ON
ft_boolean_syntax = + -><()~*:\"\"&
innodb_checksums = OFF
innodb_fast_shutdown = 1
innodb_use_native_aio = OFF
log_output = TABLE
optimizer_trace = enabled=off,one_line=off
optimizer_trace_features = greedy_search=on, range_optimizer=on, dynamic_range=on, repeated_subselect=on
relay_log_recovery = ON
slave_rows_search_algorithms = TABLE_SCAN,INDEX_SCAN
thread_handling = multiple-connections-per-thread
time_zone = Pacific/Auckland