Mysql – Performance tuning large thesql databases for zabbix

innodbmariadbMySQLperformancetuning

I have been working on tuning a mysql database for zabbix for quite some time now. We do not actually have a DBA to assist with this.

We have a Zabbix server running 16k values per second.
Our database is MariaDB 5.5.38 running galera replication with only two nodes. I will be happy to provide any information needed to assist in tuning.

The issues we are seeing is our zabbix graphs are delayed and our dbsyncers are maxed out. According the the Zabbix performance tuning book and zabbix support personnel our database is not able to keep up with the amount of data being sent its way.

We have 512GB of RAM on the server, in RAID 10 with SSD's, and 32 Cores at 2.59GHz.

The mariadb, zabbix, and apache services are all running on this server.

What can I do to increase the performance of this server. It does not seem like the OS is being over utilized based on iostat, vmstat and other OS performance checking tools.

Thanks in advance.

Best Answer

"running 16k values per second" -- Do you mean "rows inserted per second"? SELECTs per second? Something else?

The most important setting for MySQL is innodb_buffer_pool_size, which should probably be 300G for your setup. innodb_buffer_pool_instances=16.

innodb_flush_log_at_trx_commit is 1 by default. 1 is safer; 2 leads to less I/O.

If you are doing 16K INSERTs per second, there are several techniques to decrease the I/O (which you imply is 'too high'). Batching (doing multiple rows in a single INSERT), putting them a single transaction, using LOAD DATA, or "staging". Please describe things more.

VARIABLES/STATUS analysis

There were some interesting things. Here are the highlights:

( innodb_max_dirty_pages_pct ) = 30 -- When buffer_pool starts flushing to disk -- Why so low? (Probably does not matter)

( (Innodb_buffer_pool_reads + Innodb_buffer_pool_pages_flushed) ) = ((23406029 + 990332255) ) / 769924 = 1316 /sec -- InnoDB I/O -- Increase innodb_buffer_pool_size? Increase innodb_max_dirty_pages_pct?

( Created_tmp_tables ) = 300,386,596 / 769924 = 390 /sec -- Frequency of creating "temp" tables as part of complex SELECTs.

( Created_tmp_disk_tables ) = 16,281,268 / 769924 = 21 /sec -- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size and max_heap_table_size. Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.

( Select_scan ) = 22,715,682 / 769924 = 30 /sec -- full table scans -- Add indexes / optimize queries (unless they are tiny tables)

( Sort_merge_passes ) = 412,662 / 769924 = 0.54 /sec -- Heafty sorts -- Increase sort_buffer_size and/or optimize complex queries.

( long_query_time ) = 10.000000 = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 1 (and turn on slowlog) -- would help in locating naughty queries.

Threads_running = 45 -- likely to indicate that queries are stumbling over each other

Innodb_buffer_pool_pages_flushed / Questions = 0.49 -- Perhaps indicates buffer_pool is not being an efficient cache.

Select_range / Com_select = 22% -- High, but not necessarily bad. (Select_range = 360/sec)

Opened_files = 86/sec. I'm concerned about the table_open_cache, but there are not metrics to indicate whether it is "big enough".

Can you group writes into bigger transactions? There is a lot of I/O; this would decrease it. Com_begin and Com_commit are currently 257/sec.

A lot of the Writes are UPDATEs. Please elaborate on what they are doing. If they are incrementing counters, it may be worthwhile to collect them in some way.

innodb_io_capacity = 6000 -- I do not know if this is optimal for your system.

Innodb_x_lock_spin_waits = 5500/sec -- A huge number, but I don't have a good handle on what to do about it.

large_page_size -- I have not experienced this; how is it going for you?

slave_skip_errors = 1062 -- Maybe you need to use INSERT IGNORE or INSERT .. ON DUPLICATE KEY UPDATE..?

Slow log

Many of the queries have

    from  items t,hosts r
    where  t.hostid=r.hostid
      and  r.proxy_hostid=13242
      and  r.status in (0,1)
      and  t.type in (0,7,1,4,6,12,2,3,9,10,11,13,14,16,17,5)
    order by  t.itemid

Which needs these composite indexes:

items:  INDEX(hostid, type)  -- in this order
hosts:  INDEX(proxy_host_id, status, hostid)  -- this order; helps WHERE; 'covering'

My as well toss KEY hosts_3, since it will be redundant.

I see cases of redundant indexes, such as

KEY `c_items_4` (`interfaceid`),
KEY `items_6` (`interfaceid`),

Recommend DROPping one of them.

There are issues with many of the other tables, but I won't go into them; Zabbix should fix them.

more

UPDATE item_discovery ... 
    WHERE itemid BETWEEN ... AND ...
       OR itemId IN (...);

This is probably very inefficient because of the OR. The best solution (for MySQL; don't know about Zabbix) is to split it into two UPDATEs:

UPDATE item_discovery ... 
    WHERE itemid BETWEEN ... AND ...;
UPDATE item_discovery ... 
    WHERE itemId IN (...);

more 2

It should not take 6-10 seconds to insert 1000(?) rows into one partition of a table. It might help if you limited these inserts to to 100 rows at a time.

more 3

Speeding up queries, as indicated above, will help regardless. But I am now seeing a different problem. That slowlog snippet from from one 30-minute period? And that was some kind of spike that does not normally happen? What this when you reached Max_used_connections = 523? What happens in a case like that: 523 threads, all getting some tiny fraction of 32 cores, etc, is stumbling over each other. I see a few queries taking more than 3 seconds, yet they should not take more than about 3 milliseconds.

What I recommend in such situations is to figure out how to limit the number of connections upstream, that is in Zabbix client. Meanwhile, you could decrease max_connections to, say, 200. This is likely to trip up the client; but that might be better than having a long period when nothing seems to be getting done.

Also, 5.5 is getting pretty old; start laying plans for an upgrade.

Nasty DELETE

@12:37 (and perhaps every xx:37) there is a 123-second query finishing:

DELETE FROM events WHERE source = 0 AND object = 0 AND clock < ...;

Although this is the optimal index for such

KEY `events_2` (`source`,`object`,`clock`)

It is apparently not good enough. 123 seconds of disk-intensive deletes, including building rollback stuff, can slow down other things -- which I think is what happened.

Cures?

Plan A: Instead of doing the DELETE hourly, do it every, say, 5 minutes (with a different clock cutoff each time`). This should speed it up to 10 seconds instead of 123, possibly faster.

Plan B: Make it a PARTITIONed table and use DROP PARTITION, which is much less intensive on the system. (But that adds a lot of complexity, and Zabbix probably will get in the way.)

Plan C: "Continually" delete the stuff, using another process. Perhaps

Loop:
    DELETE FROM events WHERE source = 0 AND object = 0
            AND clock < ...   -- update this cutoff as needed
        LIMIT 1000;
    sleep 1 second
repeat

Plan D: See my Big Delete blog for other techniques (and some details on Plans A, B, C.)

Continue to pursue the other suggestions; you are probably near the breaking point -- That is, if the dataset doubles in size, nothing will bring the system back to life. My various suggestions will delay, but not eliminate the doomsday.