Running this query
SELECT
s0.id,
s0.symbol,
YEARWEEK(p0.datetime) AS label,
p0.id AS point_id,
p0.datetime,
KG_FIRST(p0.open, p0.datetime) AS open,
MAX(p0.high) AS high,
MIN(p0.low) AS low,
KG_LAST(p0.close, p0.datetime) AS close,
SUM(p0.volume) AS volume
FROM stock s0
LEFT JOIN point p0 ON s0.id = p0.stock_id
WHERE (s0.symbol = 'AMD') AND (p0.datetime BETWEEN '2010-01-01 00:00:00' AND '2012-01-01 00:00:00')
GROUP BY YEARWEEK(p0.datetime)
ORDER BY p0.datetime ASC;
results in the following EXPLAIN
output:
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+
| 1 | SIMPLE | s0 | const | PRIMARY,UNIQ_4B365660ECC836F9 | UNIQ_4B365660ECC836F9 | 767 | const | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | p0 | ref | idx_stock_id_datetime,IDX_B7A5F324DCD6110 | idx_stock_id_datetime | 4 | const | 75829 | Using where |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+
Now, let's make the same query but without any aggregate functions
SELECT
YEARWEEK(p0.datetime) AS label,
s0.id,
s0.symbol,
p0.id AS point_id,
p0.datetime
FROM stock s0
LEFT JOIN point p0 ON s0.id = p0.stock_id
WHERE s0.symbol = 'AMD' AND (p0.datetime BETWEEN '2010-01-01 00:00:00' AND '2012-01-01 00:00:00')
GROUP BY YEARWEEK(p0.datetime)
ORDER BY p0.datetime ASC;
This makes the query execute considerabely faster (about 20x). One would think it's because the aggregate functions caused the slowdown. Obviously, but EXPLAIN
also reveals that the rows are being fetched with a different strategy now:
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+
| 1 | SIMPLE | s0 | const | PRIMARY,UNIQ_4B365660ECC836F9 | UNIQ_4B365660ECC836F9 | 767 | const | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | p0 | range | idx_stock_id_datetime,IDX_B7A5F324DCD6110 | idx_stock_id_datetime | 12 | NULL | 391090 | Using where; Using index |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+
The access type has changed from ref
to range
and the extra column says it's able to get the rows straight from index whereas before this was not the case.
Schema definitions
CREATE TABLE IF NOT EXISTS `point` (
`id` int(11) NOT NULL auto_increment,
`datetime` datetime NOT NULL,
`volume` int(11) NOT NULL,
`stock_id` int(3) NOT NULL,
`open` decimal(10,3) NOT NULL,
`high` decimal(10,3) NOT NULL,
`low` decimal(10,3) NOT NULL,
`close` decimal(10,3) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_stock_id_datetime` (`stock_id`,`datetime`),
KEY `IDX_B7A5F324DCD6110` (`stock_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `stock` (
`id` int(11) NOT NULL auto_increment,
`symbol` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_4B365660ECC836F9` (`symbol`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `point`
ADD CONSTRAINT `FK_B7A5F324DCD6110` FOREIGN KEY (`stock_id`) REFERENCES `stock` (`id`) ON DELETE CASCADE;
Why is there a 20x speed difference between using and not using aggregate functions and how should I optimize the query to make it perform better?
Best Answer
There is one difference in the two execution plans that I think is relevant. Both queries use the
idx_stock_id_datetime (stock_id, datetime)
index but if you notice thekey_len
, the first query has4
while the second has12
.This means that the first query uses only the
stock_id
(first part) of the index, while the second uses both parts.The reason that the second query can do that, is that it doesn't any need to do any aggregated calculation on the table. All the requested information is available from the index. (Wait, all, what about
p0.id
? This is also included in the index, as it's the clustered key of the table so it is included in all other non-clustered inices).The first query though has a harder job to do. For every group it has to do several aggregations which require
open
,low
,high
columns which are not included in the index, so it has to scan a large part of the table - if not all of it - to find those values.To clarify, what slows the first query is not the aggregations but the columns needed for these aggregations that cannot be found on the index.
Since MySQL has no materialized views or computed columns yet, and if speeding this query is crucial, you could recreate the
point
table, altering the indices to:effectively making the
(stock_id, datetime)
the clustered index of the table. This will mean that all the values you need for a specific stock and a range of dates are consecutively stored in the clustered index - which in fact is the table. As this nice explanation of InnoDB clustered index says, you could call InnoDB tables index-organized tables: Understanding InnoDB clustered indexesThis will affect however all other non-SELECT operations on the table. The clustered index will no longer be ever-increasing, so depending on how you use the table, e.g. how often you are inserting/deleting/updating, you may see slowing down of the performance, when InnoDB is rearranging the values or inserting in the middle of the table/index.
Another solution might be to create another table that is a substitute of a materialized view and holds the results you need:
and is updated with triggers when rows are inserted/updated/deleted in the
point
table.