Mysql – Intermittently Slow Query @ MySQL

MySQLperformancequery-performance

Problem statement

In a nutshell, the problem is that a query which has been smoothly running for months has started behaving weird. Particularly, the exact same query runs fast most of the times, but occasionally it runs slow, or does not manage to complete at all within the allowed 20 mins time window.

Request to the community

Provide feedback on what we have been trying to do so far to understand the root causes of the issue, and share thoughts on what can be done differently.

Problem in details

In what follows I try to provide some more details about the problem.

Environment

We have a MySQL (5.6.31) server which acts as master. The master is replicated on 3 machines.
The problem is observed against all of them.

The query

The query goes like this:

SELECT t1.Id, 
 t1.Value - COALESCE(SUM(t2.SomeValue),0),  
 [A BUNCH OF COLUMNS] 
FROM schema1.table1 t1 
 LEFT JOIN schema2.table2 t2 ON t1.Id=t2.Table1Id 
GROUP BY t1.Id 
ORDER BY t1.Id;

The tables involved are defined like this:

USE schema1; 
CREATE TABLE `table1` ( 
`Id` int(11) NOT NULL, 
`Value` double NOT NULL DEFAULT '0', 
[A BUNCH OF COLUMNS] 
PRIMARY KEY (`Id`) 
) ENGINE=InnoDB; 

USE schema2; 
CREATE TABLE `table2` ( 
`Id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, 
`Table1Id` int(11) NOT NULL, 
`SomeValue` double NOT NULL, 
PRIMARY KEY (`Id`), 
KEY `Table1Id` (`Table1Id`) 
) ENGINE=InnoDB;

The table t1 has approximately 2 million rows, t2 is a smaller table with 0-100000 rows depending on status. The smaller table is periodically cleaned by a range of data being deleted (ordinary delete, no truncation or rotation).

We run EXPLAIN on the query and the output is:

MySQL configuration

MySQL is configured like this

datadir=... 
port=... 
server-id=... 
log-bin=... 
binlog-format=MIXED 
relay-log=... 
relay-log-index=... 
replicate-ignore-db=... 
expire_logs_days=3 
socket=... 
log-slave-updates=off 

long_query_time         = 1000 
slow_query_log          = 1 
slow_query_log_file     = ... 

# Disabling symbolic-links is recommended to prevent assorted security risks 
symbolic-links=0 

# Recommended in standard MySQL setup 
#sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 

innodb_additional_mem_pool_size = 83886080 
innodb_buffer_pool_size = 10G 
innodb_buffer_pool_instances    = 10 
innodb_data_home_dir    = ... 
innodb_file_per_table   = 1 
innodb_log_buffer_size  = 128M 
innodb_log_file_size    = 512M 
key_buffer_size         = 16M 
max_allowed_packet      = 96M 
max_connections         = 1024 

open_files_limit        = 10240 

query_cache_type        = 0 
query_cache_size        = 2M 
query_cache_limit       = 128k 

join_buffer_size        = 2M 
sort_buffer_size        = 48M 
read_buffer_size        = 24M 
read_rnd_buffer_size    = 24M 

table_open_cache        = 5000 

innodb_read_io_threads  = 64 
thread_cache_size       = 20 
tmp_table_size          = 2G 
max_heap_table_size     = 2G 
tmpdir      = ... 
lower_case_table_names  = 1 
federated 
max_connect_errors      = 10000 

[mysqld_safe] 
log-error=... 
pid-file=...

We ran mysqltuner, the output is

-------- Storage Engine Statistics ------------------------------------------- 
[--] Status: +ARCHIVE +BLACKHOLE +CSV +FEDERATED +InnoDB +MRG_MYISAM 
[--] Data in MyISAM tables: 1K (Tables: 1) 
[--] Data in InnoDB tables: 6G (Tables: 122) 
[!!] Total fragmented tables: 46 

-------- Performance Metrics ------------------------------------------------- 
[--] Up for: 2d 21h 31m 53s (277K q [1.109 qps], 29K conn, TX: 688B, RX: 103B) 
[--] Reads / Writes: 2% / 98% 
[--] Binary logging is enabled (GTID MODE: OFF) 
[--] Total buffers: 102.2G global + 106.2M per thread (1024 max threads) 
[OK] Maximum reached memory usage: 104.6G (41.57% of installed RAM) 
[OK] Maximum possible memory usage: 208.5G (82.84% of installed RAM) 
[OK] Slow queries: 0% (39/277K) 
[OK] Highest usage of available connections: 2% (23/1024) 
[OK] Aborted connections: 0.06%  (18/29489) 
[!!] Query cache is disabled 
[!!] Sorts requiring temporary tables: 11% (1K temp sorts / 13K sorts) 
[!!] Joins performed without indexes: 4139 
[OK] Temporary tables created on disk: 2% (857 on disk / 30K total) 
[OK] Thread cache hit rate: 99% (23 created / 29K connections) 
[OK] Table cache hit rate: 40% (380 open / 948 opened) 
[OK] Open file limit used: 0% (59/11K) 
[OK] Table locks acquired immediately: 100% (6M immediate / 6M locks) 
[OK] Binlog cache memory access: 0% ( 0 Memory / 0 Total) 

-------- MyISAM Metrics ----------------------------------------------------- 
[!!] Key buffer used: 18.3% (3M used / 16M cache) 
[OK] Key buffer size / total MyISAM indexes: 16.0M/2.4M 
[OK] Read Key buffer hit rate: 97.5% (118 cached / 3 reads) 

-------- InnoDB Metrics ----------------------------------------------------- 
[--] InnoDB is enabled. 
[OK] InnoDB buffer pool / data size: 100.0G/6.9G 
[OK] InnoDB buffer pool instances: 64 
[!!] InnoDB Used buffer: 16.13% (1057330 used/ 6553536 total) 
[OK] InnoDB Read buffer efficiency: 100.00% (63579847841 hits/ 63580059472 total) 
[!!] InnoDB Write buffer efficiency: 0.00% (0 hits/ 1 total) 
[OK] InnoDB log waits: 0.00% (0 waits / 414342 writes) 

-------- AriaDB Metrics ----------------------------------------------------- 
[--] AriaDB is disabled. 


-------- Recommendations ----------------------------------------------------- 
General recommendations: 
    Run OPTIMIZE TABLE to defragment tables for better performance 
    Adjust your join queries to always utilize indexes 
Variables to adjust: 
    long_query_time (<= 10) 
    query_cache_size (>= 8M) 
    sort_buffer_size (> 48M) 
    read_rnd_buffer_size (> 24M) 
    join_buffer_size (> 10.0M, or always use indexes with joins)

Server(s)

The instance that was considered the most in our experiments is running Centos 7.2, it has 256 GB RAM and 2×960 GB of disk (SSD, RAID 10). The machine does not look particularly loaded when it comes to cpu utilization, iowait or disk utilization and memory (more figures available on request).

Observations

In order to trigger the problem, we need some n>1 agents (it
certainly holds for many values of n>=3) triggering the same query
against the same machine.
The larger n, the more likely the problem is to appear
While a certain agent is experiencing the slow query, the others can experience a normal run of the very same query.

The query is not stuck, it is progressing slowly. This was noticed independently by analyzing the slow queries log
(Rows_examined>0, but Rows_sent=0) and also by gstack-ing the MySQL
process. In particular, we managed to see the thread corresponding to the query in action, and we could see the top of the stack being
progressing at all time. Some examples of the top of the stack

 #0  0x000000000082e20e in JOIN_CACHE::read_record_field (this=this@entry=0x7facb8012430, copy=copy@entry=0x7facb8012560, blob_in_rec_buff=blob_in_rec_buff@entry=false) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_join_buffer.cc:1486 
 #1  0x000000000082e422 in JOIN_CACHE::read_some_record_fields (this=this@entry=0x7facb8012430) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_join_buffer.cc:1406 
 #2  0x000000000082e512 in JOIN_CACHE::get_record (this=0x7facb8012430) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_join_buffer.cc:1289 
 #3  0x000000000082efec in JOIN_CACHE_BNL::join_matching_records (this=0x7facb8012430, skip_last=false) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_join_buffer.cc:1903 
 #4  0x000000000082ea01 in JOIN_CACHE::join_records (this=0x7facb8012430, skip_last=<optimized out>) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_join_buffer.cc:1708 
 #5  0x00000000006ce43f in evaluate_join_record (join=join@entry=0x7facb80130b0, join_tab=join_tab@entry=0x7facb8011530) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_executor.cc:1604 
 #6  0x00000000006ce7e9 in sub_select (join=0x7facb80130b0, join_tab=0x7facb8011530, end_of_records=<optimized out>) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_executor.cc:1279 
 #7  0x00000000006cdae8 in do_select (join=0x7facb80130b0) at /export/home/pb2/build/sb_0-18928004-1463439221.59/rpm/BUILD/mysql-5.6.31/mysql-5.6.31/sql/sql_executor.cc:936 
...

Debugging Attempts

For what concerns mysqltuner output, neither fragmentation nor lack of indexes seem to be causing this issue: we rebuilt table2 and didn't notice any effect, and even though it is not good to have non-indexed joins, this should not be the case for the query in question.

Beside mysqltuner suggestions, we did some experiments with MySQL setup. All of them have been based on intuitions rather than exact science, none of them helped (an arrow means subsequent values that we tried for that specific setting).

Query chache. We tried to turn it on (query_cache_size = 64M, query_cache_type = 1 and query_cache_limit = 2M) and off
innodb_buffer_pool_size=200g -> 100G -> 10G
join_buffer_size=2M -> 40M -> 2M
read_buffer_size=2M -> 20M -> 24M
read_rnd_buffer_size=4M -> 8M -> 16M -> 24M
sort_buffer_size=4M -> 12M -> 48M
innodb_buffer_pool_instances=72 -> 10

UPDATE 1
After the discussion we arranged some extra logging. Essentially, we dumped the size of t1 and t2 every 30 seconds.

All issues experienced in the first part of today happened when t2 was shrinked in size. In other words, the issue would occur between two consecutive measurements of t2's size such that:
i) the first one reports "big" (70-100 krows) values for t2's size;
ii) the second one reports very small (at most few thousands rows) values for t2'size.

Some code takes care of shrinking the table on the master via a delete from t2 where Id <= {}. The code is included in a transaction, which seems very quick at all times.

Does this ring any bell?

Best Answer

Questions:

Does "[bunch of columns]" include any from schema2? (If not, I have a possible answer.)
What is status in "0-100000 rows depending on status"?
"When the day grows" -- is that when you get closer to 100000 rows?
Are you sure that nothing else is running when this happens?
InnoDB Write buffer efficiency: 0.00% (0 hits/ 1 total) implies that the table is never written to. Could that be correct?
Is the query being done via cron?

As for mysqltuner, you are encountering some of its failings.

In all cases you set the buffer_pool_size bigger than all the data (6.9G), hence changing the size made no difference.
innodb_buffer_pool_instances makes only a small difference. I think some benchmark suggested no more than 16. Or perhaps it was related to the number of CPU cores.
Turning on the QC would benefit you only if the various 'agents' performed exactly the same query and no writes were being performed to either schema1 or schema2.
The join_buffer should not be used for this query. Note that it is not mentioned in Extra in EXPLAIN SELECT.
Defragmentation is almost never an issue with InnoDB; ignore mysqltuner's persistent call for such.

What I see (so far) in the query is InnoDB "stumbling over itself". But I rarely see such with only 3 connections.

Also, the "inflate-deflate" of "JOIN + GROUP BY" is an ugly pattern that is costly. I need more details on the query to help you unravel that.

Related Solutions

MySQL Replication: Insert generating VERY huge temporary tables

Algorithmically, it is doing what your colleague says. But, do you see what it is doing ???

It is generating 10,000 temp tables each containing 1 row after traversing 317 million rows through in the InnoDB internal index. Each temp table is a complete regeneration of the rowids in sernumbers_results_2009 table along with executing handler_read_prev commands internally to sort the data by an index scan from the back of the internal rowid index. Also, please remember you are dealing with InnoDB. Who knows what Multiversioning (via MVCC) is going on so that the INSERT is completed without interference and with rollback capabilities.

Is there any reason why this query wouldn't work for you ???

INSERT INTO sernumbers_results_2009
SELECT * FROM sernumbers_results
ORDER BY rowid DESC LIMIT 10000;

This will definitely generate one temp table.

Give it a Try !!!

Mysql – How to debug a db memory-leak causing thesql to go before it’s own limits

...even surpassing it's theorically maximum possible allocation.

[OK] Maximum possible memory usage: 7.3G (46% of installed RAM)

There is not actually a way to calculate maximum possible memory usage for MySQL, because there is no cap on the memory it can request from the system.

The calculation done by mysqltuner.pl is only an estimate, based on a formula that doesn't take into account all possible variables, because if all possible variables were taken into account, the answer would always be "infinite." It's unfortunate that it's labeled this way.

Here is my theory on what's contributing to your excessive memory usage:

thread_cache_size       = 128

Given that your max_connections is set to 200, the value of 128 for thread_cache_size seems far too high. Here's what makes me think this might be contributing to your problem:

When a thread is no longer needed, the memory allocated to it is released and returned to the system unless the thread goes back into the thread cache. In that case, the memory remains allocated.

^{http://dev.mysql.com/doc/refman/5.6/en/memory-use.html}

If your workload causes even an occasional client thread to require a large amount of memory, those threads may be holding onto that memory, then going back to the pool and sitting around, continuing to hold on to memory they don't technically "need" any more, on the premise that holding on to the memory is less costly than releasing it if you're likely to need it again.

I think it's worth a try to do the following, after first making a note of how much memory MySQL is using at the moment.

Note how many threads are currently cached:

mysql> show status like 'Threads_cached';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| Threads_cached | 9     |
+----------------+-------+
1 row in set (0.00 sec)

Next, disable the thread cache.

mysql> SET GLOBAL thread_cache_size = 0;

This disables the thread cache, but the cached threads will stay in the pool until they're used one more time. Disconnect from the server, then reconnect and repeat.

mysql> show status like 'Threads_cached';

Continue disconnecting, reconnecting, and checking until the counter reaches 0.

Then, see how much memory MySQL is holding.

You may see a decrease, possibly significant, and then again you may not. I tested this on one of my systems, which had 9 threads in the cache. Once those threads had all been cleared out of the cache, the total memory held by MySQL did decrease... not by much, but it does illustrate that threads in the cache do release at least some memory when they are destroyed.

If you see a significant decrease, you may have found your problem. If you don't, then there's one more thing that needs to happen, and how quickly it can happen depends on your environment.

If the theory holds that the other threads -- the ones currently servicing active client connections -- have significant memory allocated to them, either because of recent work in their current client session or because of work requiring a lot of memory that was done by another connection prior to them languishing in the pool, then you won't see all of the potential reduction in memory consumption until those threads are allowed to die and be destroyed. Presumably your application doesn't hold them forever, but how long it will take to know for sure whether there's a difference will depend on whether you have the option of cycling your application (dropping and reconnecting the client threads) or if you'll have to just wait for them to be dropped and reconnected over time on their own.

But... it seems like a worthwhile test. You should not see a substantial performance penalty by setting thread_cache_size to 0. Fortunately, thread_cache_size is a dynamic variable, so you can freely change it with the server running.

Best Answer

Related Solutions

MySQL Replication: Insert generating VERY huge temporary tables

Mysql – How to debug a db memory-leak causing thesql to go before it’s own limits

Related Question