Mysql – Why do aggregate functions alter the way the rows are fetched

MySQLoptimization

Running this query

SELECT
    s0.id,
    s0.symbol,
    YEARWEEK(p0.datetime) AS label,
    p0.id AS point_id,
    p0.datetime,
    KG_FIRST(p0.open, p0.datetime) AS open,
    MAX(p0.high) AS high,
    MIN(p0.low) AS low,
    KG_LAST(p0.close, p0.datetime) AS close,
    SUM(p0.volume) AS volume
FROM stock s0
LEFT JOIN point p0 ON s0.id = p0.stock_id
WHERE (s0.symbol = 'AMD') AND (p0.datetime BETWEEN '2010-01-01 00:00:00' AND '2012-01-01 00:00:00')
GROUP BY YEARWEEK(p0.datetime)
ORDER BY p0.datetime ASC;

results in the following EXPLAIN output:

+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type  | possible_keys                             | key                   | key_len | ref   | rows  | Extra                                        |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+
|  1 | SIMPLE      | s0    | const | PRIMARY,UNIQ_4B365660ECC836F9             | UNIQ_4B365660ECC836F9 | 767     | const |     1 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | p0    | ref   | idx_stock_id_datetime,IDX_B7A5F324DCD6110 | idx_stock_id_datetime | 4       | const | 75829 | Using where                                  | 
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+-------+----------------------------------------------+

Now, let's make the same query but without any aggregate functions

SELECT
    YEARWEEK(p0.datetime) AS label,
    s0.id,
    s0.symbol,
    p0.id AS point_id,
    p0.datetime
FROM stock s0
LEFT JOIN point p0 ON s0.id = p0.stock_id
WHERE s0.symbol = 'AMD' AND (p0.datetime BETWEEN '2010-01-01 00:00:00' AND '2012-01-01 00:00:00')
GROUP BY YEARWEEK(p0.datetime)
ORDER BY p0.datetime ASC;

This makes the query execute considerabely faster (about 20x). One would think it's because the aggregate functions caused the slowdown. Obviously, but EXPLAIN also reveals that the rows are being fetched with a different strategy now:

+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+
| id | select_type | table | type  | possible_keys                             | key                   | key_len | ref   | rows   | Extra                                        |
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+
|  1 | SIMPLE      | s0    | const | PRIMARY,UNIQ_4B365660ECC836F9             | UNIQ_4B365660ECC836F9 | 767     | const |      1 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | p0    | range | idx_stock_id_datetime,IDX_B7A5F324DCD6110 | idx_stock_id_datetime | 12      | NULL  | 391090 | Using where; Using index                     | 
+----+-------------+-------+-------+-------------------------------------------+-----------------------+---------+-------+--------+----------------------------------------------+

The access type has changed from ref to range and the extra column says it's able to get the rows straight from index whereas before this was not the case.

Schema definitions

CREATE TABLE IF NOT EXISTS `point` (
  `id` int(11) NOT NULL auto_increment,
  `datetime` datetime NOT NULL,
  `volume` int(11) NOT NULL,
  `stock_id` int(3) NOT NULL,
  `open` decimal(10,3) NOT NULL,
  `high` decimal(10,3) NOT NULL,
  `low` decimal(10,3) NOT NULL,
  `close` decimal(10,3) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `idx_stock_id_datetime` (`stock_id`,`datetime`),
  KEY `IDX_B7A5F324DCD6110` (`stock_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `stock` (
  `id` int(11) NOT NULL auto_increment,
  `symbol` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `UNIQ_4B365660ECC836F9` (`symbol`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

ALTER TABLE `point`
  ADD CONSTRAINT `FK_B7A5F324DCD6110` FOREIGN KEY (`stock_id`) REFERENCES `stock` (`id`) ON DELETE CASCADE;

Why is there a 20x speed difference between using and not using aggregate functions and how should I optimize the query to make it perform better?

Best Answer

There is one difference in the two execution plans that I think is relevant. Both queries use the idx_stock_id_datetime (stock_id, datetime) index but if you notice the key_len, the first query has 4 while the second has 12.

This means that the first query uses only the stock_id (first part) of the index, while the second uses both parts.

The reason that the second query can do that, is that it doesn't any need to do any aggregated calculation on the table. All the requested information is available from the index. (Wait, all, what about p0.id? This is also included in the index, as it's the clustered key of the table so it is included in all other non-clustered inices).

The first query though has a harder job to do. For every group it has to do several aggregations which require open, low, high columns which are not included in the index, so it has to scan a large part of the table - if not all of it - to find those values.

To clarify, what slows the first query is not the aggregations but the columns needed for these aggregations that cannot be found on the index.


Since MySQL has no materialized views or computed columns yet, and if speeding this query is crucial, you could recreate the point table, altering the indices to:

PRIMARY KEY `idx_stock_id_datetime` (`stock_id`,`datetime`),
  UNIQUE KEY  (`id`),

effectively making the (stock_id, datetime) the clustered index of the table. This will mean that all the values you need for a specific stock and a range of dates are consecutively stored in the clustered index - which in fact is the table. As this nice explanation of InnoDB clustered index says, you could call InnoDB tables index-organized tables: Understanding InnoDB clustered indexes

This will affect however all other non-SELECT operations on the table. The clustered index will no longer be ever-increasing, so depending on how you use the table, e.g. how often you are inserting/deleting/updating, you may see slowing down of the performance, when InnoDB is rearranging the values or inserting in the middle of the table/index.


Another solution might be to create another table that is a substitute of a materialized view and holds the results you need:

stock_id,
YEARWEEK(datetime) AS label,
KG_FIRST(open, datetime) AS open,
MAX(high) AS high,
MIN(low) AS low,
KG_LAST(close, datetime) AS close,
SUM(volume) AS volume

and is updated with triggers when rows are inserted/updated/deleted in the point table.