Aurora MySQL 5.7 – Troubleshooting Random Failures

amazon-rdsaurora

This is the 5th time. It happens once a week (Tuesday or Wednesday within 03:00-07:00 UTC+0). On the console, it shows available but inaccessible. We try to wait if the instance will recover itself, after ~30 min nothing happens. So I reboot it manually, then it came online again after rebooting (~5 min).

It would be helpful to know what actually went wrong. This is only a dev server with few users and records.

Engine: Aurora MySQL 5.7.12
DB instance class: db.t2.small
Backup time: 16:00-16:30 UTC+0
Maintenance time: sun:17:00-sun:17:30 UTC+0

Below is the only list of available logs after rebooting the instance.

error/mysql-error-running.log.2018-07-24.03 Tue Jul 24 11:14:06 GMT+800 2018    11.8 kB
error/mysql-error-running.log.2018-07-24.04 Tue Jul 24 11:30:00 GMT+800 2018    285.5 kB
error/mysql-error-running.log.2018-07-24.05 Tue Jul 24 12:30:00 GMT+800 2018    31.1 kB
error/mysql-error-running.log.2018-07-24.06 Tue Jul 24 13:30:00 GMT+800 2018    31.8 kB
error/mysql-error-running.log.2018-07-24.07 Tue Jul 24 14:30:00 GMT+800 2018    32.9 kB
error/mysql-error-running.log.2018-07-24.08 Tue Jul 24 15:30:00 GMT+800 2018    29 kB
error/mysql-error-running.log.2018-07-24.09 Tue Jul 24 16:30:00 GMT+800 2018    32.1 kB
error/mysql-error-running.log.2018-07-24.10 Tue Jul 24 17:30:00 GMT+800 2018    27.5 kB
error/mysql-error-running.log.2018-07-24.11 Tue Jul 24 18:30:00 GMT+800 2018    31.7 kB
error/mysql-error-running.log.2018-07-24.12 Tue Jul 24 19:30:00 GMT+800 2018    27.1 kB
error/mysql-error-running.log.2018-07-24.13 Tue Jul 24 20:30:00 GMT+800 2018    22.4 kB
error/mysql-error-running.log.2018-07-24.14 Tue Jul 24 21:30:00 GMT+800 2018    22.8 kB
error/mysql-error-running.log.2018-07-24.15 Tue Jul 24 22:30:00 GMT+800 2018    24.7 kB
error/mysql-error-running.log.2018-07-24.16 Tue Jul 24 23:30:00 GMT+800 2018    24.7 kB
error/mysql-error.log   Wed Jul 25 00:34:45 GMT+800 2018    2.6 kB
external/mysql-external.log Wed Jul 25 00:30:00 GMT+800 2018    7.6 kB

external/mysql-external.log

/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
----------------------- END OF LOG ----------------------

error/mysql-error-running.log.2018-07-24.03 shows: https://pastebin.com/ywmXLR5g.

error/mysql-error-running.log.2018-07-24.04 shows: https://pastebin.com/g1dkR6rj.

error/mysql-error-running.log.2018-07-24.18 shows: https://pastebin.com/g0aAXfaT.

All other logs shows nothing(see photo).

enter image description here

Event Logs

July 24, 2018 at 11:14:14 AM UTC+8  DB instance restarted
July 24, 2018 at 11:13:31 AM UTC+8  Error restarting mysql: Engine bootstrap failed with no mysqld process running...
July 24, 2018 at 11:12:01 AM UTC+8  Recovery of the DB instance is complete.
July 24, 2018 at 11:04:26 AM UTC+8  Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered.

CPU Utilization (07-24-2018)
enter image description here

CPU Utilization (07-11-2018 to 07-24-2018)
enter image description here

Best Answer

Special thanks to @WilsonHauck. After 4 weeks of monitoring, Manually upgrading Aurora to the latest version solves the issue.

There have been several bugfixes addressing unexpected restarts on 2.01.1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraMySQL.Updates.20Updates.html

To manually upgrade your Aurora:

  1. Go to RDS - AWS Console
  2. Navigate to Clusters
  3. Select your cluster
  4. Click Actions >> Upgrade now