Mysql – AWS Aurora running out of freeable memory and crashing

We have an issue with our AWS Aurora MySQL instance running out of freeable memory and crashing as a result. AWS's response was to upgrade to a bigger instance, but I don't feel that will necessarily solve the problem.

We are on a db.r5.large instance which is 15GB of RAM.

After a reboot it drops to 5GB of RAM which is pretty much expected, but then gradually declines over the course of a week to 0GB RAM and then reboots – and sometimes fails to do so and requires a manual reboot of the instance.

CPU usage generally hovers around 15% which spikes when we do heavy late-night processing.

From How large should be mysql innodb_buffer_pool_size?

SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;

gives 10GB as required to hold all data and indexes in memory.

SELECT (PagesData*PageSize)/POWER(1024,3) DataGB FROM
(SELECT variable_value PagesData
FROM information_schema.global_status
WHERE variable_name='Innodb_buffer_pool_pages_data') A,
(SELECT variable_value PageSize
FROM information_schema.global_status
WHERE variable_name='Innodb_page_size') B;

gives 4.6GB in the InnoDB Buffer Pool

SHOW FULL PROCESSLIST

shows only a couple of processes and no hung threads.

My question is, is there a way of ensuring the freeable memory never goes past a certain point and is released back for usage e.g it should never fall below 2GB so if something is memory heavy, it won't exhaust memory.

I understand there might be a performance hit for this, but before I scale up to a bigger instance (at double the cost) and experience the same slow decline to 0, I'd like to see how it performs as it's better than a production machine crashing at random times.

Additonial information:

I've added the various MySQL outputs to https://pastebin.com/CkRxqL04

Not really sure how to run the Unix commands on the RDS instance as there's no access to the filesystem.

I checked other logs and it seems the CPU is steadily climbing too when it should burst and drop back to a normal level.

This is not a high traffic site, though it does do some heavy lifting behind the scenes.

While doing further digging, I noticed there's an attempt at synching to a slave still active from when the initial migration was done. Even as root I can't remove it, but could this be eating memory / CPU?

2019-08-29 00:30:04 7112 [Note] Error reading relay log event: slave SQL thread was killed
2019-08-29 00:30:04 7112 [Note] Slave I/O thread killed while connecting to master
2019-08-29 00:30:04 7112 [Note] Slave I/O thread exiting, read up to log 'mysql-bin-changelog.000016', position 904410367

UPDATE 19/09/09

The pattern still seems to be the same – CPU is going up and up – albeit gradually.

Slow log is empty (I know it's turned on as there was previously entries in it).

I also installed an APM tool and there is nothing suspicious coming up there – all script execute in the time they should as do the Db queries. We have some long-running CRON jobs (3 minutes), but they shouldn't affect the DB as they're on the webserver box.

log_output=TABLE,FILE # from TABLE so you can review your Error Log FILE after crash log_warnings=2 # from 1 to include aborted_connects and other info in your Error Log query_cache_min_res_unit=512 # from 4096 to conserve QC RAM used query_cache_size=64M # from ~ 443M to reduce CPU cycles used for QC management innodb_lru_scan_depth=100 # from 1024 to conserve 90% CPU cycles used for function thread_cache_size=24 # from 11 to reduce threads_created count read_rnd_buffer_size=256K # from 512K to reduce Handler_read_rnd_next RPS of 2,558 read_buffer_size=512K # from 256K to reduce Handler_read_next RPS of 29,608 tmp_table_size=64M # from 16M to expand capacity max_heap_table_size=64M # from 16M to reduce created_tmp_disk_tables RPHr of 1,134

Best Answer

Rate Per Second=RPS - Suggestions to consider for your AWS Aurora Parameters Group

if ALLOWED,

innodb_io_capacity=1900  # from 200 to enable higher IOPS to SSD devices

OBSERVATIONS, innodb_buffer_pool_size adjustment to be considered when innodb_buffer_pool_reads > 50 RPS (you are less than 1 RPS today) innodb_change_buffering=none to be reconsidered, review refman details

These Suggestions may help stabilize your instance.

Disclaimer: I am the content author of website mentioned in my profile, Network profile where we offer FREE downloadable Utility Scripts to assist with improving performance, more suggestions, contact info.

Best Answer

Related Solutions

MySQL 5.6.17 running out of memory when opening table with many partitions

Mysql Crash SIGSEGV signal 11, recovered after killing apport

Related Question