The problem was that our new RDS instance was being throttled on writeIOPS, we could see queued disk operations (DiskQueueDepth
of 10 or greater) and we could see writeIOPS
averaging 300, whereas Amazon only provide 100 reliable iOPS (source).
The solution was to fork out approx $200 per month for "provisioned iOPS". 1000 is the smallest you could buy (which easily covers us). After this change which took 5 hours to complete on 350G of storage (but only 30s of downtime with a multi-AZ instance) everything was fine and our MySQL 5.6 instance was running better than the original MySQL 5.5 instance like all the literature promises.
The problem was that the new RDS instance that we were given since migrating to 5.6 was just not as good as our original, which was just bad luck. We asked Amazon support "Q: Why would there be a difference in mysql5.5 than mysql5.6?"
From Amazon:
Your previous underlying host may have been benefiting from unusually light use on the underlying volumes. You may have grown accustomed to very light use which allowed you to consume additional IOPs without accumulating as much DiskQueueDepth
. However, when you moved to a new underlying host by upgrading you may now be experiencing more typical use of the underlying volumes. With typical use of the underlying volumes it's not recommended to consistently exceed the amount of reliable IOPs while expecting to maintain the lower DiskQueueDepth
that you've grown accustomed to. However, there is no way to confirm or deny that this was the cause.
Your question is about troubleshooting, not about 'fixing' the CPU issue.
Nearly all CPU issues can be traced to bad/missing indexes or poorly written queries.
Do you have the slowlog turned on? With long_query_time = 2
? Can you run pt-query-digest? Well, probably you said no to all of those. Pursue them for future use.
Can you do SHOW FULL PROCESSLIST
? Probably. Look at the queries that don't say "Sleep" and don't look like they have something to do with replication. They will probably be the naughty queries.
Show us the naughtiest query, together with SHOW CREATE TABLE
for the table(s) that it uses. Also provide EXPLAIN SELECT ...
so we can see how the query is being performed.
If you are lucky, we will say "add INDEX(...)" or "change the WHERE clause in this way...". If you are less lucky, the solution could be more complex.
(None of the specs you mentioned -- ram, cores, #tables, table size, etc., are important to the question. The version might be important because the solution might involve a 'new' feature.)
Since you have provided the SELECT
, I can make some guesses and proceed without the rest of the data.
Keeping junk in the table and filtering them out hurts performance (='N', ='Y'), but I will live with that for the moment.
INDEX(rep_type, rep_status, pub_id, rep_pub_date)
might make things run a lot faster. But... The speedup will only work (I think) when the IN
clause has only one item.
OFFSET 1200
means that 1200 rows have to be fetched, then tossed. Let's work on decreasing that overhead...
(Note that JOIN...ON
is a better way to write JOINs.)
Oops, I need the PRIMARY KEY
of reports
; I'll assume it is rep_id
.
SELECT r2.* , d.name AS dom_name, p.pub_name
FROM
( SELECT rep_id
FROM reports
WHERE rep_type = 'N'
AND pub_id IN ( 3 )
AND rep_status = 'Y'
ORDER BY rep_pub_date DESC
LIMIT 1200 , 100
) r1
JOIN reports AS r2 USING(rep_id)
JOIN domains AS d ON d.dom_id = r2.domain_id
JOIN publisher AS p ON p.pub_id = r2.pub_id;
Some notes:
* The subquery looks only at reports, and does not have to haul around 1200 rows to be thrown away.
* Once the subquery is finished, the JOINs are doing only 100 rows each.
* This format will benefit from the above index.
* When the IN has multiple items, the index will work even better for the subquery than the original query. But still not 'perfectly'.
* My index works well only because it handles all of the WHERE
clause, plus all of the ORDER BY
.
To get even better performance for "pagination" than you can get through OFFSET
, see my blog: mysql.rjweb.org. That blog also points out two bugs you possibly don't know you currently have.
Another issue... Do users really go as far as the 13th page? Or is this a search engine crawler causing you grief?
Best Answer
Since your commits are stalling, it's pretty safe to assume you're running with
innodb_flush_log_at_trx_commit=1
(see #4 here), so one obvious change you could do is switch toinnodb_flush_log_at_trx_commit=0
, however that may not be the right choice for you - see the previous link.In any case,
COMMITs
don't usualy stall due to other activity on the server (like queries), unless that other activity locks some kind of mutex for far too long.If I had to guess without any further evidence, I would say that your writes are so intensive, that MySQL is unable to flush fast enough using adaptive checkpointing, so it exceeds 75% of the log capacity and starts sharp checkpointing (which blocks server activity for a short period).
There are few quick ways to make MySQL handle writes better (besides one already mentioned above):
Otherwise it would be really useful to see disk activity from AWS, amount of data server currently writes (you can calculate it using
Log sequence number
delta between two consecutiveshow engine innodb status\G
runs)There are other ways too, but for that, I'd need a lot more details on the actual workload.