I have been working on tuning a mysql database for zabbix for quite some time now. We do not actually have a DBA to assist with this.
We have a Zabbix server running 16k values per second.
Our database is MariaDB 5.5.38 running galera replication with only two nodes. I will be happy to provide any information needed to assist in tuning.
The issues we are seeing is our zabbix graphs are delayed and our dbsyncers are maxed out. According the the Zabbix performance tuning book and zabbix support personnel our database is not able to keep up with the amount of data being sent its way.
We have 512GB of RAM on the server, in RAID 10 with SSD's, and 32 Cores at 2.59GHz.
The mariadb, zabbix, and apache services are all running on this server.
What can I do to increase the performance of this server. It does not seem like the OS is being over utilized based on iostat, vmstat and other OS performance checking tools.
Thanks in advance.
Best Answer
"running 16k values per second" -- Do you mean "rows inserted per second"?
SELECTs
per second? Something else?The most important setting for MySQL is
innodb_buffer_pool_size
, which should probably be 300G for your setup.innodb_buffer_pool_instances=16
.innodb_flush_log_at_trx_commit
is 1 by default. 1 is safer; 2 leads to less I/O.If you are doing 16K
INSERTs
per second, there are several techniques to decrease the I/O (which you imply is 'too high'). Batching (doing multiple rows in a singleINSERT
), putting them a single transaction, usingLOAD DATA
, or "staging". Please describe things more.VARIABLES/STATUS analysis
There were some interesting things. Here are the highlights:
( innodb_max_dirty_pages_pct ) = 30
-- When buffer_pool starts flushing to disk -- Why so low? (Probably does not matter)( (Innodb_buffer_pool_reads + Innodb_buffer_pool_pages_flushed) ) = ((23406029 + 990332255) ) / 769924 = 1316 /sec
-- InnoDB I/O -- Increase innodb_buffer_pool_size? Increase innodb_max_dirty_pages_pct?( Created_tmp_tables ) = 300,386,596 / 769924 = 390 /sec
-- Frequency of creating "temp" tables as part of complex SELECTs.( Created_tmp_disk_tables ) = 16,281,268 / 769924 = 21 /sec
-- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size and max_heap_table_size. Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.( Select_scan ) = 22,715,682 / 769924 = 30 /sec
-- full table scans -- Add indexes / optimize queries (unless they are tiny tables)( Sort_merge_passes ) = 412,662 / 769924 = 0.54 /sec
-- Heafty sorts -- Increase sort_buffer_size and/or optimize complex queries.( long_query_time ) = 10.000000 = 10
-- Cutoff (Seconds) for defining a "slow" query. -- Suggest 1 (and turn on slowlog) -- would help in locating naughty queries.Threads_running = 45
-- likely to indicate that queries are stumbling over each otherInnodb_buffer_pool_pages_flushed / Questions = 0.49
-- Perhaps indicates buffer_pool is not being an efficient cache.Select_range / Com_select = 22%
-- High, but not necessarily bad. (Select_range = 360/sec
)Opened_files = 86/sec
. I'm concerned about thetable_open_cache
, but there are not metrics to indicate whether it is "big enough".Can you group writes into bigger transactions? There is a lot of I/O; this would decrease it.
Com_begin
andCom_commit
are currently 257/sec.A lot of the Writes are
UPDATEs
. Please elaborate on what they are doing. If they are incrementing counters, it may be worthwhile to collect them in some way.innodb_io_capacity = 6000
-- I do not know if this is optimal for your system.Innodb_x_lock_spin_waits = 5500/sec
-- A huge number, but I don't have a good handle on what to do about it.large_page_size
-- I have not experienced this; how is it going for you?slave_skip_errors = 1062
-- Maybe you need to useINSERT IGNORE
orINSERT .. ON DUPLICATE KEY UPDATE..
?Slow log
Many of the queries have
Which needs these composite indexes:
My as well toss KEY hosts_3, since it will be redundant.
I see cases of redundant indexes, such as
Recommend
DROPping
one of them.There are issues with many of the other tables, but I won't go into them; Zabbix should fix them.
more
This is probably very inefficient because of the
OR
. The best solution (for MySQL; don't know about Zabbix) is to split it into twoUPDATEs
:more 2
It should not take 6-10 seconds to insert 1000(?) rows into one partition of a table. It might help if you limited these inserts to to 100 rows at a time.
more 3
Speeding up queries, as indicated above, will help regardless. But I am now seeing a different problem. That slowlog snippet from from one 30-minute period? And that was some kind of spike that does not normally happen? What this when you reached
Max_used_connections = 523
? What happens in a case like that: 523 threads, all getting some tiny fraction of 32 cores, etc, is stumbling over each other. I see a few queries taking more than 3 seconds, yet they should not take more than about 3 milliseconds.What I recommend in such situations is to figure out how to limit the number of connections upstream, that is in Zabbix client. Meanwhile, you could decrease
max_connections
to, say, 200. This is likely to trip up the client; but that might be better than having a long period when nothing seems to be getting done.Also, 5.5 is getting pretty old; start laying plans for an upgrade.
Nasty DELETE
@12:37 (and perhaps every xx:37) there is a 123-second query finishing:
Although this is the optimal index for such
It is apparently not good enough. 123 seconds of disk-intensive deletes, including building rollback stuff, can slow down other things -- which I think is what happened.
Cures?
Plan A: Instead of doing the
DELETE
hourly, do it every, say, 5 minutes (with a differentclock
cutoff each time`). This should speed it up to 10 seconds instead of 123, possibly faster.Plan B: Make it a
PARTITIONed
table and useDROP PARTITION
, which is much less intensive on the system. (But that adds a lot of complexity, and Zabbix probably will get in the way.)Plan C: "Continually" delete the stuff, using another process. Perhaps
Plan D: See my Big Delete blog for other techniques (and some details on Plans A, B, C.)
Continue to pursue the other suggestions; you are probably near the breaking point -- That is, if the dataset doubles in size, nothing will bring the system back to life. My various suggestions will delay, but not eliminate the doomsday.