Mysql – Unexpected failover of RDS-MySQL server

amazon-rdsfailoverMySQL

We have a 16GB RDS-MySQL Multi-AZ instance. Of late, it has unexpectedly failed over multiple times causing the dependent application to be offline for over 15 minutes.

AWS-Support blamed it on high Swap Usage. However, the instance has failed when the Swap Usage was as low as 10 MB. We have started optimizing the queries.

Is there anything else that we can do to avoid such mishap?

===================================================================

EDIT :

We have made one observation. The CPU utilization of the instance remains fairly low. So, we presumed that the issue might be with disk access. The write-throughput is several times that of read-throughput. To add to it, Sort_merge_passes has a value of 79425(which is crazy, right ?). On going through the slow query log, it was found that there are multiple queries with ORDER BY clause not using indexes. It will take time before workarounds are found for all these queries. Can you suggest some immediate fix to overcome the failover issue ? Should we increase the global value of sort_buffer_size ? Current value is 2MB.

===========================================================

EDIT 2 :

Below is a screenshot of the rate of change of Sort_Merge_Passes(Queries per second) as seen on RDS-Performance Insights.

enter image description here

As can be seen, during periods of load,the value varies from 0.5 to 1(we have observed values higher than 1 as well). A query can cause multiple Sort_merge_passes, 8 being the highest that we have observed.

=======================================================================

EDIT 3 :

We also found that the global status Created_tmp_disk_tables is increasing at the rate of approx 2-3/second. Badly structured queries I guess.

Best Answer

  • Swapping is a bad symptom of over-use of RAM.
  • An excessive number of simultaneous connections may lead to swapping.
  • Very complex queries may lead to swapping (due to temp tables). Let's see such queries if you have any.
  • Have you changed any things in my.cnf (or the equivalent in RDS)? If so let's see if you raised some values too much.
  • As a quick fix (and if RDS lets you), decrase innodb_buffer_pool_size by a gigabyte.

  • Sort_merge_passes - divide by Uptime. 1/second would be "high", but not high enough to be concerned with.

  • For deeper analysis, see http://mysql.rjweb.org/doc.php/mysql_analysis -- the GLOBAL STATUS and VARIABLES may find some clues; pt-query-digest will focus on the 'worst' queries.
  • sort_buffer_size = 2M is fine.
  • "ORDER BY clause not using indexes" - show us the queries, plus SHOW CREATE TABLE.