Mysql – Optimization for aggregation and grouping with MySQL

aggregategroup byMySQLoptimization

I have an application that is pulling data on a daily basis from numerous APIs, storing it and then presenting it on the front end in various charts, tables, etc..

I'm having some issues avoiding Using temporary and Using filesort with some of the queries being run on a database.

Now, I know that sometimes this is unavoidable. Should I even bother changing it in this situation?

The application caches the requests that run the most of these queries, so they won't be run often at all. However for the sake of best practices and learning I want to make sure that I'm running things optimally.

Here's what's causing it. There's an assortment of tables containing various types of data. Assume a table structure where most of the columns are data that can be aggregated (SUM, AVG, etc) and any other columns are either ignored or grouped. One column consistent across the report data tables is a DATE column. As the data is pulled daily from APIs, the data in each table is being stored for said date.

The majority of the front end components are currently displaying data to the end users at a monthly detail level (daily is needed though for upcoming features — this level of detail cannot be compromised).

I'm currently using MySQL to group and aggregate data.

Here's a basic illustration:

SELECT
    DATE_FORMAT(c.date, '%Y-%m') as date,
    SUM(c.conversions) as conversions
FROM data_ga_conversions c
WHERE c.goal_id = 1 AND c.date BETWEEN '2014-10-01' AND '2015-02-28'
GROUP BY YEAR(c.date), MONTH(c.date)

Note the GROUP BY clause using date functions. As I'm sure you know — this is what's causing the query executor to use temporary and filesort. (Note: I'm aware that I can get rid of filesort by adding ORDER BY NULL, however I'm trying to keep these basic sorting and aggregation calculations in one place, application code or SQL.)

Am I approaching this the wrong way? Here's a couple of alternate methods I had considered:

  1. Storing year, month and day in separate columns.
  2. Aggregate tables (à la materialized views… switching to postgresql is a potential option too).
  3. Let the application do it

Thanks!

Best Answer

Using temporary and filesort are not as evil as everyone makes them out to be.

Do not split a DATE into year, month, and day columns; it will cause more pain than it is worth.

I prefer (in your specific case) to do GROUP BY LEFT(c.date, 7), although it may not speed up anything.

"Summary Tables" is the way to go for that type of query, assuming you have large tables. More discussion in my blog. You would add one (maybe more) row to the summary table each night, then a query against that table could summarize any day range, including your month.

Do not store AVG(...) in a Summary table; it is probably more 'correct' to store SUM(foo) AS sum_foo and COUNT(*) AS foo_ct, then have the report

SUM(sum_foo) / SUM(foo_ct) AS 'Average foo'

Probably the optimal summary table would have

PRIMARY KEY(goal_id, date)

and have sum_conversions calculated from SUM(c.conversions), plus anything else relevant. Often, one summary table can handle a small set of standard 'reports'.