Mysql – Optimize MySQL query with MAX, GROUP BY, and WHERE

MySQLmysql-5.6performancequery-performance

I have a fairly simple query that I've failed to optimize sufficiently despite adding a bunch of indexes.

The query is:

SELECT max(elapsed_seconds)
FROM sql_queries
WHERE created_at >= now() - interval 1 week
GROUP BY `sql`

I've tried adding the following indexes:

  KEY `sql_queries_sql_index` (`sql`),
  KEY `sql_queries_elapsed_seconds_index` (`elapsed_seconds`),
  KEY `sql_queries_created_at_index` (`created_at`),
  KEY `sql_queries_sql_created_at_index` (`sql`,`created_at`),
  KEY `sql_queries_sql_elapsed_seconds_index` (`sql`,`elapsed_seconds`),
  KEY `sql_queries_created_at_sql_elapsed_seconds` (`created_at`,`sql`,`elapsed_seconds`)

Obviously there's too many (and redundant) indexes — I just kept adding them hoping the query would run faster.

The table has 24 million rows and the query currently takes about four minutes.

"explain" shows:

+----+-------------+-------------+------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+-----------------------------------------------------------+
| id | select_type | table       | partitions | type  | possible_keys                                                                                                                                                        | key                                        | key_len | ref  | rows     | filtered | Extra                                                     |
+----+-------------+-------------+------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | sql_queries | NULL       | range | sql_queries_sql_index,sql_queries_created_at_index,sql_queries_sql_created_at_index,sql_queries_sql_elapsed_seconds_index,sql_queries_created_at_sql_elapsed_seconds | sql_queries_created_at_sql_elapsed_seconds | 5       | NULL | 11574092 |   100.00 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------------+------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+-----------------------------------------------------------+

The column definitions are:

  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `sql` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
  `elapsed_seconds` double DEFAULT NULL,
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,

I'd like to do avg and count on each group, but I've omitted that to simplify discussions.

Any ideas / tips are greatly appreciated.

Update 1: I tried simplifying the query by hard-coding the date like this:

select max(elapsed_seconds) from sql_queries where created_at >= '2018-4-22' group by `sql`;

The query time (after doing a first query to warm up the caches) decreases from 2 min 5 sec to 1 min 53 secs. So not a significant improvement.

Update 2:

Here's the explain statements for this simplified query:

mysql> explain select max(elapsed_seconds) from sql_queries where created_at >= '2018-4-22' group by `sql`;
+----+-------------+-------------+------------+-------+-----------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+--------------------------+
| id | select_type | table       | partitions | type  | possible_keys                                                                                 | key                                        | key_len | ref  | rows     | filtered | Extra                    |
+----+-------------+-------------+------------+-------+-----------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+--------------------------+
|  1 | SIMPLE      | sql_queries | NULL       | index | sql_queries_sql_index,sql_queries_created_at_index,sql_queries_sql_created_at_elapsed_seconds | sql_queries_sql_created_at_elapsed_seconds | 780     | NULL | 29773986 |    50.00 | Using where; Using index |
+----+-------------+-------------+------------+-------+-----------------------------------------------------------------------------------------------+--------------------------------------------+---------+------+----------+----------+--------------------------+

Update 3:

As a sanity check I tried removing the date constraint and added an index on (sql, elapsed_seconds). The query was then instantaneous.

Best Answer

There are only three serious candidates:

(`created_at`,`sql`,`elapsed_seconds`) -- 1
(`created_at`,`elapsed_seconds`,`sql`) -- 2
(`sql`,`created_at`,`elapsed_seconds`) -- 3

Both are "covering". That is, the query can be handled entirely in the index. EXPLAIN indicates such by saying Using index.

Analysis:

(`created_at`,`sql`,`elapsed_seconds`) -- 1
(`created_at`,`elapsed_seconds`,`sql`) -- 2

filter first. But then the rest of the index is not in any useful order. So it sorts to do the GROUP BY and eventually finds the max. It cannot simply reach for the 'last' entry to get MAX. I don't think either of these is better than the other of the two.

(`sql`,`created_at`,`elapsed_seconds`) -- 3

might avoid the sort, since the sql values come one at a time. Also, the Optimizer might be able to jump to the starting point in the index for the desired created_at (for each sql). Again, it cannot simply reach for the 'last' entry to get MAX.

I vote for #3. However, this is an area where there have been optimization improvements. That is, an older version of MySQL may not do, for example, the leapfrogging.