I'm trying to diagnose the bottleneck for high-write I/O activity on a percona master (with replication to a slave). The pattern of behaviour is roughly cyclic and regular: 60 mins of 75-90% write-based IO utilisation followed by 10 mins of ~25%. Occasionally this spikes to such a point that the fronting webservice is unresponsive for a couple of minutes.
Doing a SHOW PROCESSLIST shows the replication process and then a number of idle threads (though occasionally there are 1-2 COMMITs which complete by the time I reissue the process list command) but nothing indicating a specific active query.
Innodb buffer status variables:
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_%';
+-----------------------------------------+-----------+
| Variable_name | Value |
+-----------------------------------------+-----------+
| Innodb_buffer_pool_pages_data | 1678714 |
| Innodb_buffer_pool_pages_dirty | 718 |
| Innodb_buffer_pool_pages_flushed | 195847729 |
| Innodb_buffer_pool_pages_LRU_flushed | 0 |
| Innodb_buffer_pool_pages_free | 5299854 |
| Innodb_buffer_pool_pages_made_not_young | 0 |
| Innodb_buffer_pool_pages_made_young | 59369 |
| Innodb_buffer_pool_pages_misc | 99319 |
| Innodb_buffer_pool_pages_old | 619661 |
| Innodb_buffer_pool_pages_total | 7077887 |
+-----------------------------------------+-----------+
10 rows in set (0.00 sec)
SHOW ENGINE INNODB STATUS
givesBuffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
.
The mysqld section from my.cnf is
[mysqld]
user = mysql
default-storage-engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid-file = /var/run/mysqld/mysqld.pid
key-buffer-size = 32M
myisam-recover = FORCE,BACKUP
max-allowed-packet = 16M
max-connect-errors = 1000000
innodb = FORCE
datadir = /srv/mysql/
server_id = 30
log-bin = /srv/mysql/mysql-bin
expire-logs-days = 14
sync-binlog = 1
tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 2048
tmpdir = /tmp
innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1
innodb-buffer-pool-size = 108G
log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
The slow query log has a number of entries but none of them are long-lived. A sample extract is:
use xxxxxxxxx;
SET timestamp=1411484104;
SELECT amnews_article
.aid
, amnews_article
.sid_id
, amnews_article
.fid_id
, amnews_article
.url
, amnews_article
.url_hash
, amnews_article
.title
, amnews_article
.sha1
, amnews_article
.date_ts
, amnews_article
.summary
, amnews_article
.author_feed
, amnews_article
.author
, amnews_article
.language
, amnews_article
.fetched
, amnews_article
.processed
, amnews_article
.ngrams
, amnews_article
.indexed_ngram
, amnews_article
.topics_ngram
, amnews_article
.journals
, amnews_article
.persons
, amnews_article
.links
, amnews_article
.dois
, amnews_article
.verified
, amnews_article
.journals_version
, amnews_article
.duplicate_of_id
, amnews_article
.external
, amnews_article
.blog
, amnews_article
.last_updated
, amnews_article
.grouped
FROM amnews_article
ORDER BY amnews_article
.aid
DESC LIMIT 1;
As an aside I'm not sure why that query makes it into the slow query log because it runs quickly and uses indexes, as shown by an EXPLAIN
+----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+ | 1 | SIMPLE | amnews_article | index | NULL | PRIMARY | 4 | NULL | 1 | | +----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+
Underlying, the disks are spinning platters configured as an ext4 LVM partition on RAID1 and the machine is physical (albeit managed) rather than a VM.
The data partition is on the same disks (LVM volume group) as the log partition. RAM usage is ~90% of 128G and swap is available but not being used.
$ free -m
total used free shared buffers cached
Mem: 128912 117370 11542 0 516 36092
-/+ buffers/cache: 80761 48151
Swap: 2045 0 2045
Version is 5.5.36-34.2-648.precise-log
Edit to add info asked for by VĂ©race:
Is your web server on the same box?
The webserver is on a different box from the db.Are there cron jobs?
There are db-related cronjobs but they run from various different hosts. (But with queries not showing in the process list I don't know where to direct attention).How are you determing that there's an I/O bottleneck?
We use newrelic for server and service monitoring. It produces graphs that show the long periods of high IO usage with the flatlining at 100MB/s.iotop -ao
over ~10 mins show that the processes responsible are mysql. (Well, jdb is the very top but presumably that's expected with the disk setup)
My questions are twofold:
What is mysql doing writing so much data if nothing is shown in the process list (i.e. how can I find out which queries are responsible)?
Are there any configuration changes I can make to alleviate the almost-perpetual high IO?
Best Answer
This kind of sounds like poor flushing behavior, ideally you'd be seeing consistent IO not stretches of 90 and 25.
Might be a big help to do this:
http://www.percona.com/blog/2013/12/03/innodb_stats_on_metadata-slow-queries-information_schema/