You really need to define what level of "HA" you are looking for quantitatively -- one man's "can-sleep-at-night' is another man's "this-thing-is-a-house-of-cards".
The minimum number of systems is two (2) -- An active and a standby replica (with something like heartbeat or custom-grown scripts to handle the failover).
With a MySQL cluster this means at least two SQL nodes and two Data nodes (to continue serving requests in the event of a failure of any one node). (If you also need redunancy on the management server's functions you would need two of those as well).
The key part here is testing the failover in a development environment -- which means you need at least two more machines (or a virtual machine). You also want to test upgrade and maintenance processes to ensure they won't trigger unintended consequences (Ideally you should do nothing to production that hasn't been tested and proven in Development).
If you fail to properly test you may trigger failover, which means you incur the procedural (and possibly business) cost of a failover -- typically having to rebuild the former active server to be a new standby server.
This protects against hardware failures (power supply, NIC, disk, switch (if they're on separate switches).
Note that this doesn't just apply to your DB servers -- You need two of *EVERYTHING: Web servers, DBI servers, Firewalls, DNS servers...
Redundancy of one component is meaningless if you still have a bunch of single-points-of-failure in your stack.
The next level of protection is network failures ("What if my ISP goes down?") - this requires replicating your whole redundant environment above to a remote datacenter.
What's important here is that you diversify network connections, power, etc -- You don't want your standby datacenter across the street where it's fed by the same power and fiber as your main facility.
A company I consulted for had a requirement that any remote facility used for DR be "at least 15 degrees of longitude away" (i.e. "In the next time zone"). A common practice in the US is East-Coast/West-Coast, or NY/Chicago LA/Texas.
The next level above that is truly distributed resources (think Google) which requires a database system that supports replication and sharding (think MongoDB).
If implemented properly there's almost no chance of a true "outage", though service may occasionally be degraded and recovery can take a while.
I am not sure that you can get the database, but you can identify the user with one of the following methods:
- The general log
- The error log (
log_warnings = 2
)
- An extra patch called user statistics (included by default in Percona and MariaDB)
- An audit plugin like McAfee's
The first three are mentioned on this article on MySQL performance blog.
Sadly, visilibity on the host cache was only added until 5.6 on performance_schema.host_cache
.
As a temporary workaround, you would be required to set max_connect_errors
to a large value instead of 10, and set MAX_CONNECTIONS_PER_HOUR
and MAX_USER_CONNECTIONS
per user to an appropriate value (you should be using different users if you are not, and/or different DNS names).
Best Answer
The advice you've seen about max_connections is a distraction from the issue you're seeing, because it has nothing to do with connection errors.
You can log the failed connections to the mysql error log by issuing this:
You can also make this a permanent setting in my.cnf so that it persists after a restart.
http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_log-warnings