AWS RDS MySQL – Diagnosing Failures

amazon-rdsawsMySQL

I have a simple RDS setup: one database, no replication. It has been running for around 2 years without any problems. CPU usage is generally less than 5%, sometimes boosting to around 10%.

Today, without any apparent reason or warning, my application lost connection with the DB. Looking at the log files, I could see the message "Recovery of the DB instance started…" and a few minutes later "Recovery of the DB instance complete". At that point my application was able to reconnect and work fine.

How do I go about diagnosing this further? The log file has about 30 lines in it, starting with "Giving 2 client threads a chance to die gracefully", then "Shutting down slave threads". After that, the service goes through a restart procedure.

Is it normal operation for RDS to 'recover the instance' after a failure like this? Presumably I could lose a few minutes of data?

Update:

The logs are no longer available, so I cannot post them. Also, I notice that the freeable memory jumps up sharply from the time of the incident, which would seem to be a good thing.

Best Answer

I have had this happen, even to production RDSs, though it is rare.

How to go about diagnosing

Check there was no advance notice of maintenance work: Did you get any email notifications from AWS to say that your instance needed mandatory system upgrades or maintenance? Was this in your instance's permitted maintenance window?

Raise a ticket with AWS Support: If you have AWS support, raise a ticket with them to ask them what happened. On the occasions that this has happened and I have raised a ticket they were not able to give me a good reason for the DB going away, but have generally shrugged and named local networking issues on the instance's HyperV.

Did I loose data? It's unlikely though that you lost data if you are using innoDb and it was a shutdown issued by AWS. The log lines that you cite: "Giving 2 client threads a chance to die gracefully", then "Shutting down slave threads" Are lines from a command initiated shutdown, rather than a crash. It looks like AWS issued a shutdown for you.

Other Notes Occasionally AWS sees fit to move your instance to another Hypervisor without prior notice, sometimes because they want to clear the Hypervisor that your instance is on, it may have developed hardware problems, for example. There might be networking issues within AWS, or you might be subject to 'noisy neighbours'. There are many possible reasons and unfortunately contractually AWS don't have to tell you why.