There is one difference in the two execution plans that I think is relevant. Both queries use the idx_stock_id_datetime (stock_id, datetime)
index but if you notice the key_len
, the first query has 4
while the second has 12
.
This means that the first query uses only the stock_id
(first part) of the index, while the second uses both parts.
The reason that the second query can do that, is that it doesn't any need to do any aggregated calculation on the table. All the requested information is available from the index. (Wait, all, what about p0.id
? This is also included in the index, as it's the clustered key of the table so it is included in all other non-clustered inices).
The first query though has a harder job to do. For every group it has to do several aggregations which require open
, low
, high
columns which are not included in the index, so it has to scan a large part of the table - if not all of it - to find those values.
To clarify, what slows the first query is not the aggregations but the columns needed for these aggregations that cannot be found on the index.
Since MySQL has no materialized views or computed columns yet, and if speeding this query is crucial, you could recreate the point
table, altering the indices to:
PRIMARY KEY `idx_stock_id_datetime` (`stock_id`,`datetime`),
UNIQUE KEY (`id`),
effectively making the (stock_id, datetime)
the clustered index of the table. This will mean that all the values you need for a specific stock and a range of dates are consecutively stored in the clustered index - which in fact is the table. As this nice explanation of InnoDB clustered index says, you could call InnoDB tables index-organized tables: Understanding InnoDB clustered indexes
This will affect however all other non-SELECT operations on the table. The clustered index will no longer be ever-increasing, so depending on how you use the table, e.g. how often you are inserting/deleting/updating, you may see slowing down of the performance, when InnoDB is rearranging the values or inserting in the middle of the table/index.
Another solution might be to create another table that is a substitute of a materialized view and holds the results you need:
stock_id,
YEARWEEK(datetime) AS label,
KG_FIRST(open, datetime) AS open,
MAX(high) AS high,
MIN(low) AS low,
KG_LAST(close, datetime) AS close,
SUM(volume) AS volume
and is updated with triggers when rows are inserted/updated/deleted in the point
table.
This:
SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id
Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:
SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');
or:
SELECT users.*
FROM users
WHERE EXISTS (
SELECT 1
FROM timesheets
WHERE timesheets.user_id = users.id
AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);
It works because users.id
is the primary key, so all other fields of users
are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name
or whatever for any given users.id
row.
(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR
about needing to use an aggregate or include the field in the GROUP BY
here).
Best Answer
Get the maximum
updated_at
for a givenid
from each table, combine the results into one row set and then get the maximum from itThis is much more efficient then combining the tables first and then filtering and aggregating the combined set, even though the code would look simpler that way.
This method can also benefit from an index on
(id, udpated_at)
in each table, while probably no index would help if you decided to combine whole tables first.