I currently have one RDS instance, db.t3.medium
running Aurora MySQL.
Due to increase workload (expected), the CPU is, on average running just over its baseline and therefore in about 10 days will run out of CPU credits and then start getting expensive. The 15-minute average CPU Utilization is around 22% (baseline is 20%).
RDS doesn't support any larger db.t3.
versions for Aurora MySQL so I either need to change instance class, or move to Aurora Serverless, or look at a second instance alongside. I'm not sure what's going to be the best move so keen to get some advice.
Second Instance
Up until now I've avoided running a second db instance because:
- I don't fully understand how to set it up correctly to share the workload (adding different endpoints for read vs write in my applications would be tricky)
- I'm not familiar if this would actually share the load like an EC2 load balancer does
- most of my MySQL CPU usage is from reading (relatively little write), so I don't predict splitting it to a reader and a writer helping too much (the reader will probably still run high). But my understanding around how this all works is patchy.
I'm aware of the pitfalls of a single instance around availability/fallover etc so it would be nice to mitigate this as a bonus from this upgrade.
Switch Instance Type
If I switch instance types, a db.r5.large
is twice the cost of a db.t3.medium
but still only has 2 x vCPU. It has a lot more Memory (16 v 4) but my issue is CPU.
I'm not sure how the ECU (the actual relative measure of processing power) compares on db.r5.large
vs db.t3.medium
as it's documented as 'variable', so not sure if I'm comparing apples with oranges between the r5
and the t3
.
db.r5.large
ECU = 10db.t3.medium
ECU = Varaible
The db.r5.xlarge
has 4 vCPU but would quadruple my RDS costs.
Sidenote:
I have 9 months left on a t3.medium
reserved instance, so would lose that if I do anything other than setup a second t3
instance.
Serverless
Based on the fairly vague documentation around serverless resources (1 ACU has approximately 2 GB of memory with corresponding CPU and networking, similar to what is used in Aurora user-provisioned instances.), I would guestimate 2 ACUs is the equivalent of a db.t3.medium
, so 4 ACUs would give me double the CPU as my current setup, for roughly the same cost as a db.r5.large
.
Serverless would throw in the added benefit of built-in availability and replication.
The RDS load is fairly steady, rather than spiky, so I wouldn't benefit from any time the Serverless scales down to 1 ACU for example.
What's best?
What would you do, and why?!
Thanks!
Additional Info
- Any SSD or NVME devices on MySQL Host server? Not sure – it's just a standard AWS RDS instance but not sure what default devices are used.
- SHOW GLOBAL STATUS; https://pastebin.com/xkTaPKKF
- SHOW GLOBAL VARIABLES; https://pastebin.com/CGjeGJCv
- SHOW FULL PROCESSLIST; https://pastebin.com/T8YveG5q
- SHOW ENGINE INNODB STATUS; https://pastebin.com/EkQpyqui
- SELECT name, count FROM information_schema.innodb_metrics ORDER BY name; https://pastebin.com/kemMrNpS
- MySQl Error Log – last 12 hours – all of which were generated during a restart of the server: https://pastebin.com/A6sV0Qxc
Best Answer
Analysis of STATUS and VARIABLES:
Observations:
The More Important Issues:
innodb_buffer_pool_size
could be increased, but not so much as to cause swapping. (If you don't have much data, then increasing it won't make any difference.)If the disk is SSD, increase
innodb_io_capacity
to 500. (The I/O seems low, so this may not have any impact.)max_allowed_packet
is 12% of RAM; suggest lowering it to 1% of RAM unless you need a huge packet size.Some things indicate very little I/O, as most of the work is in the CPU. This further pushes toward studying the slow queries.
The Query cache (
query_cache_type
) isON
. And there is some indication of heavy activity with it. This may be adding to the CPU for "pruning" or it may be helping with the CPU. (I can't tell which.) Suggest you change toDEMAND
; at the same time, addSQL_CACHE
to queries that are likely to benefit from the QC andSQL_NO_CACHE
to the others. This goes right afterSELECT
. These changes will make the QC more focused, thereby (hopefully) decreasing the CPU usage on both sides.If you turn on the slowlog (which I recommend), be sure to lower
long_query_time
to, say 1.Details and other observations:
( Key_blocks_used * 1024 / key_buffer_size ) = 22 * 1024 / 16M = 0.13%
-- Percent of key_buffer used. High-water-mark. -- Lower key_buffer_size (now 16777216) to avoid unnecessary memory usage.( (key_buffer_size / 0.20 + innodb_buffer_pool_size / 0.70) ) = ((16M / 0.20 + 1513M / 0.70)) / 4096M = 54.7%
-- Most of available ram should be made available for caching. -- http://mysql.rjweb.org/doc.php/memory( innodb_lru_scan_depth ) = 1,024
-- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth( innodb_io_capacity_max / innodb_io_capacity ) = 2,000 / 200 = 10
-- Capacity: max/plain -- Recommend 2. Max should be about equal to the IOPs your I/O subsystem can handle. (If the drive type is unknown 2000/200 may be a reasonable pair.)( innodb_change_buffering ) = innodb_change_buffering = none
-- Pre-5.6.11 / 5.5.31, there was a bug that made ="changes" a safer option.( innodb_doublewrite ) = innodb_doublewrite = OFF
-- Extra I/O, but extra safety in crash. -- OFF is OK for FusionIO, Galera, Replicas, ZFS.( Innodb_os_log_written ) = 512 / 5899529 = 0.31 /HR
-- This is an indicator of how busy InnoDB is. -- Very idle InnoDB.( Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group / innodb_log_file_size ) = 512 / (5899529 / 3600) / 2 / 48M = 3.1e-9
-- Ratio -- (see minutes)( Uptime / 60 * innodb_log_file_size / Innodb_os_log_written ) = 5,899,529 / 60 * 48M / 512 = 9.67e+9
-- Minutes between InnoDB log rotations Beginning with 5.6.8, this can be changed dynamically; be sure to also change my.cnf. -- (The recommendation of 60 minutes between rotations is somewhat arbitrary.) Adjust innodb_log_file_size (now 50331648). (Cannot change in AWS.)( Handler_rollback ) = 68,089,346 / 5899529 = 12 /sec
-- Why so many rollbacks?( innodb_flush_neighbors ) = 1
-- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.( innodb_io_capacity ) = 200
-- I/O ops per second capable on disk . 100 for slow drives; 200 for spinning drives; 1000-2000 for SSDs; multiply by RAID factor.( innodb_adaptive_hash_index ) = innodb_adaptive_hash_index = OFF
-- Usually should be ON. -- There are cases where OFF is better. See also innodb_adaptive_hash_index_parts (after 5.7.9) and innodb_adaptive_hash_index_partitions (MariaDB and Percona). ON has been implicated in rare crashes (bug 73890).( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF
-- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.( max_allowed_packet ) = 512M / 4096M = 12.5%
-- If you do not have large blobs (etc) to load, then decrease the value. Else decrease innodb_buffer_pool_size (now 1586495488) to make room. Swapping is terrible for performance.( innodb_ft_result_cache_limit ) = 2,000,000,000 / 4096M = 46.6%
-- Byte limit on FULLTEXT resultset. (Possibly not preallocated, but grows?) -- Lower the setting.( character_set_server ) = character_set_server = latin1
-- Charset problems may be helped by setting character_set_server (now latin1) to utf8mb4. That is the future default.( local_infile ) = local_infile = ON
-- local_infile (now ON) = ON is a potential security issue( Qcache_lowmem_prunes ) = 142,717,868 / 5899529 = 24 /sec
-- Running out of room in QC -- increase query_cache_size (now 88158208)( Qcache_lowmem_prunes/Qcache_inserts ) = 142,717,868/210841277 = 67.7%
-- Removal Ratio (frequency of needing to prune due to not enough memory)( Qcache_not_cached ) = 624,955,002 / 5899529 = 105 /sec
-- SQL_CACHE attempted, but ignored -- Rethink caching; tune qcache( Qcache_not_cached / (Qcache_hits + Com_select + Qcache_not_cached) ) = 624,955,002 / (608264262 + 835804146 + 624955002) = 30.2%
-- Percent of SELECTs that were not cached in the QC. -- QC is not very useful.( (query_cache_size - Qcache_free_memory) / Qcache_queries_in_cache / query_alloc_block_size ) = (88158208 - 1227312) / 34276 / 8192 = 0.31
-- query_alloc_block_size vs formula -- Adjust query_alloc_block_size (now 8192)( Select_scan ) = 171,787,477 / 5899529 = 29 /sec
-- full table scans -- Add indexes / optimize queries (unless they are tiny tables)( Select_scan / Com_select ) = 171,787,477 / 835804146 = 20.6%
-- % of selects doing full table scan. (May be fooled by Stored Routines.) -- Add indexes / optimize queries( relay_log_space_limit ) = 1,000,000,000 = 953.7MB
-- The max total size for relay logs on a Replica. (0=unlimited) -- Let's discuss the rationale for having a limit.( slow_query_log ) = slow_query_log = OFF
-- Whether to log slow queries. (5.1.12)( long_query_time ) = 10
-- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2Abnormally small:
Abnormally large:
Abnormal strings: