Monitor everything that you can as frequently as you can. I highly recommend Graphite w/ statsd as a central location to collect all your metrics. It provides a very simple plaintext protocol that makes it trivial to log nearly any metric data and a UI that makes it incredibly easy to compare one metric against another. On my systems, I collect a ton of information and most of it has proved invaluable at some point or another. Here are a few of them:
I wrote a daemon called mysampler that send the output of SHOW GLOBAL STATUS
to graphite (or csv, if you want) at regular intervals. We log this at 5s intervals, but there are times that I wish we had it set to 1s intervals. You start to see some very interesting patterns at that level of granularity. It's aware of what stats are counters and which are absolute values (Questions is a counter, Threads_running is an absolute value) and will output the deltas for the counters.
ab-tblstats2g runs from cron every night and sends table size statistics to graphite so we can track table growth. I plan to extend it to include the maximum primary key value and the number of rows (from table statistics) in the near future. It also works with MSSQL Server.
mysql_logger logs the output of SHOW FULL PROCESSLIST to syslog every X interval of time. It makes it trivial to find out exactly what was running concurrently when something goes weird (table locks, long running queries, etc). We dump that data into Splunk for easy searching, but I still sometimes just use grep in the syslog logs.
pt-stalk from the Percona Toolkit is great for "what just happened?" scenarios. It watches server status variables to exceed a certain value (Threads_connected
> 25 by default, but Threads_running
is usually a more valuable metric, in my experience) and when triggered, collects a bunch of data about MySQL and the system which can be reviewed with pt-sift or by just reviewing the generated files. It will even generate tcpdumps, gdb, oprofile, and strace traces.
That's basically what we monitor, which differs from alerting. For alerting, I suggest you alert on a very small number of metrics. You can cover 90% of cases by simply choosing a workload-representative query and setting a threshold on how long it takes to return. If it exceeds that threshold, alert...there is a problem. Otherwise, you're ok. No need to check "is the process running," or anything like that. Other things to look for are entries in the MySQL error log, approaching too many connections, and the how well replication is functioning (slave lag, slave running, tables in sync). Hit ratios are completely useless for alerting purposes - all that matters is that queries are returning within some period of time.
For further reading, the white paper Preventing MySQL Emergencies by the Percona folks is a good read that goes into great detail on what to monitor and alert on. Percona has also released a set of Nagios Plugins (which should work with Zabbix, I believe) that you can use.
Managed to solve this, these are the steps I followed:
Firstly, I contacted the Amazon RDS team by posting on their discussion forum, they confirmed it was the mysqld process taking up all this CPU - this eliminated a configuration fault with something else running on the physical server
Secondly I tracked down the source of the queries that were running:
SELECT `mytable`.* FROM `mytable` WHERE `mytable`.`foreign_key` = 231273 LIMIT 1
I originally overlooked this as the cause, because none of these queries seemed to be taking particularly long when I monitored the show processlist output. After exhausting other avenues, I decided it might be worth following up....and I'm glad I did.
As you can see in the show processlist output, these queries were coming from a utlility server, which runs some tactical utility jobs that exist outside of our main application code. This is why they were not showing up as slow or causing issues in our new relic monitoring, because the new relic agent is only installed on our main app server.
Loosely following this guide:
http://www.mysqlperformanceblog.com/2007/02/08/debugging-sleeping-connections-with-mysql/
I was able to trace these queries to a specific running process on our utility server box. This was a bit of ruby code that was very inefficiently iterating through around 70,000 records, checking some field values and using those to decide whether it needs to create a new record in 'mytable.' After doing some analysis I was able to determine, the process was no longer needed so could be killed.
Something that was making matters worse, there seemed to be 6 instances of this same process running at one time due to the way the cron job was configured and how long each one took! I killed off these processes, and incredibly our CPU usage fell from around 100% to around 5%!
Best Answer
I would suggest you to take a look at atop. When you run it interactively you can select 'i' then press 1 , after that you can select 'd'.
Atop will colour the resources thrashed and 'd' will show only processes hitting disks.