Mysql – Aggregate Same Column, Different Tables

aggregateMySQL

I have the following schema:

table_1
id, updated_at

table_2
id, updated_at

table_3
id, updated_at

and in total there are about 12 tables that I want to get the updated_at column and determine the MAX() aggregate value from all of them, based on a single id value.

How would I set up this query?

Best Answer

Get the maximum updated_at for a given id from each table, combine the results into one row set and then get the maximum from it

SELECT
  MAX(updated_at)
FROM
  (
    SELECT MAX(updated_at) AS updated_at FROM table1 WHERE id = @param_id
    UNION ALL
    SELECT MAX(updated_at) FROM table2 WHERE id = @param_id
    UNION ALL
    SELECT MAX(updated_at) FROM table3 WHERE id = @param_id
    UNION ALL
    ...
  ) AS derived
;

This is much more efficient then combining the tables first and then filtering and aggregating the combined set, even though the code would look simpler that way.

This method can also benefit from an index on (id, udpated_at) in each table, while probably no index would help if you decided to combine whole tables first.

Related Solutions

Mysql – Why do aggregate functions alter the way the rows are fetched

There is one difference in the two execution plans that I think is relevant. Both queries use the idx_stock_id_datetime (stock_id, datetime) index but if you notice the key_len, the first query has 4 while the second has 12.

This means that the first query uses only the stock_id (first part) of the index, while the second uses both parts.

The reason that the second query can do that, is that it doesn't any need to do any aggregated calculation on the table. All the requested information is available from the index. (Wait, all, what about p0.id? This is also included in the index, as it's the clustered key of the table so it is included in all other non-clustered inices).

The first query though has a harder job to do. For every group it has to do several aggregations which require open, low, high columns which are not included in the index, so it has to scan a large part of the table - if not all of it - to find those values.

To clarify, what slows the first query is not the aggregations but the columns needed for these aggregations that cannot be found on the index.

Since MySQL has no materialized views or computed columns yet, and if speeding this query is crucial, you could recreate the point table, altering the indices to:

PRIMARY KEY `idx_stock_id_datetime` (`stock_id`,`datetime`),
  UNIQUE KEY  (`id`),

effectively making the (stock_id, datetime) the clustered index of the table. This will mean that all the values you need for a specific stock and a range of dates are consecutively stored in the clustered index - which in fact is the table. As this nice explanation of InnoDB clustered index says, you could call InnoDB tables index-organized tables: Understanding InnoDB clustered indexes

This will affect however all other non-SELECT operations on the table. The clustered index will no longer be ever-increasing, so depending on how you use the table, e.g. how often you are inserting/deleting/updating, you may see slowing down of the performance, when InnoDB is rearranging the values or inserting in the middle of the table/index.

Another solution might be to create another table that is a substitute of a materialized view and holds the results you need:

stock_id,
YEARWEEK(datetime) AS label,
KG_FIRST(open, datetime) AS open,
MAX(high) AS high,
MIN(low) AS low,
KG_LAST(close, datetime) AS close,
SUM(volume) AS volume

and is updated with triggers when rows are inserted/updated/deleted in the point table.

Mysql – group by clause without aggregate function

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

Best Answer

Related Solutions

Mysql – Why do aggregate functions alter the way the rows are fetched

Mysql – group by clause without aggregate function

Related Question