Mysql – AWS Aurora running out of freeable memory and crashing

aws-auroracrashmemoryMySQL

We have an issue with our AWS Aurora MySQL instance running out of freeable memory and crashing as a result. AWS's response was to upgrade to a bigger instance, but I don't feel that will necessarily solve the problem.

We are on a db.r5.large instance which is 15GB of RAM.

After a reboot it drops to 5GB of RAM which is pretty much expected, but then gradually declines over the course of a week to 0GB RAM and then reboots – and sometimes fails to do so and requires a manual reboot of the instance.

Graph of freeable memory declining to 0

CPU usage generally hovers around 15% which spikes when we do heavy late-night processing.

From How large should be mysql innodb_buffer_pool_size?

SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;

gives 10GB as required to hold all data and indexes in memory.

SELECT (PagesData*PageSize)/POWER(1024,3) DataGB FROM
(SELECT variable_value PagesData
FROM information_schema.global_status
WHERE variable_name='Innodb_buffer_pool_pages_data') A,
(SELECT variable_value PageSize
FROM information_schema.global_status
WHERE variable_name='Innodb_page_size') B;

gives 4.6GB in the InnoDB Buffer Pool

SHOW FULL PROCESSLIST

shows only a couple of processes and no hung threads.

My question is, is there a way of ensuring the freeable memory never goes past a certain point and is released back for usage e.g it should never fall below 2GB so if something is memory heavy, it won't exhaust memory.

I understand there might be a performance hit for this, but before I scale up to a bigger instance (at double the cost) and experience the same slow decline to 0, I'd like to see how it performs as it's better than a production machine crashing at random times.


Additonial information:

I've added the various MySQL outputs to https://pastebin.com/CkRxqL04

Not really sure how to run the Unix commands on the RDS instance as there's no access to the filesystem.

I checked other logs and it seems the CPU is steadily climbing too when it should burst and drop back to a normal level.

CPU usage keeps increasing

This is not a high traffic site, though it does do some heavy lifting behind the scenes.

DB connections

While doing further digging, I noticed there's an attempt at synching to a slave still active from when the initial migration was done. Even as root I can't remove it, but could this be eating memory / CPU?

2019-08-29 00:30:04 7112 [Note] Error reading relay log event: slave SQL thread was killed
2019-08-29 00:30:04 7112 [Note] Slave I/O thread killed while connecting to master
2019-08-29 00:30:04 7112 [Note] Slave I/O thread exiting, read up to log 'mysql-bin-changelog.000016', position 904410367

UPDATE 19/09/09

The pattern still seems to be the same – CPU is going up and up – albeit gradually.

Slow log is empty (I know it's turned on as there was previously entries in it).

I also installed an APM tool and there is nothing suspicious coming up there – all script execute in the time they should as do the Db queries. We have some long-running CRON jobs (3 minutes), but they shouldn't affect the DB as they're on the webserver box.

Memory since reboot
CPU since reboot
SHOW GLOBAL STATUS LIKE 'com_stmt_%';

Best Answer

Rate Per Second=RPS - Suggestions to consider for your AWS Aurora Parameters Group

log_output=TABLE,FILE  # from TABLE so you can review your Error Log FILE after crash
log_warnings=2  # from 1 to include aborted_connects and other info in your Error Log
query_cache_min_res_unit=512  # from 4096 to conserve QC RAM used
query_cache_size=64M  # from ~ 443M to reduce CPU cycles used for QC management
innodb_lru_scan_depth=100  # from 1024 to conserve 90% CPU cycles used for function
thread_cache_size=24  # from 11 to reduce threads_created count
read_rnd_buffer_size=256K  # from 512K to reduce Handler_read_rnd_next RPS of 2,558
read_buffer_size=512K  # from 256K to reduce Handler_read_next RPS of 29,608
tmp_table_size=64M  # from 16M to expand capacity
max_heap_table_size=64M  # from 16M to reduce created_tmp_disk_tables RPHr of 1,134

if ALLOWED,

innodb_io_capacity=1900  # from 200 to enable higher IOPS to SSD devices

OBSERVATIONS, innodb_buffer_pool_size adjustment to be considered when innodb_buffer_pool_reads > 50 RPS (you are less than 1 RPS today) innodb_change_buffering=none to be reconsidered, review refman details

These Suggestions may help stabilize your instance.

Disclaimer: I am the content author of website mentioned in my profile, Network profile where we offer FREE downloadable Utility Scripts to assist with improving performance, more suggestions, contact info.