I have an application that is pulling data on a daily basis from numerous APIs, storing it and then presenting it on the front end in various charts, tables, etc..
I'm having some issues avoiding Using temporary
and Using filesort
with some of the queries being run on a database.
Now, I know that sometimes this is unavoidable. Should I even bother changing it in this situation?
The application caches the requests that run the most of these queries, so they won't be run often at all. However for the sake of best practices and learning I want to make sure that I'm running things optimally.
Here's what's causing it. There's an assortment of tables containing various types of data. Assume a table structure where most of the columns are data that can be aggregated (SUM, AVG, etc) and any other columns are either ignored or grouped. One column consistent across the report data tables is a DATE
column. As the data is pulled daily from APIs, the data in each table is being stored for said date.
The majority of the front end components are currently displaying data to the end users at a monthly detail level (daily is needed though for upcoming features — this level of detail cannot be compromised).
I'm currently using MySQL to group and aggregate data.
Here's a basic illustration:
SELECT
DATE_FORMAT(c.date, '%Y-%m') as date,
SUM(c.conversions) as conversions
FROM data_ga_conversions c
WHERE c.goal_id = 1 AND c.date BETWEEN '2014-10-01' AND '2015-02-28'
GROUP BY YEAR(c.date), MONTH(c.date)
Note the GROUP BY
clause using date functions. As I'm sure you know — this is what's causing the query executor to use temporary and filesort. (Note: I'm aware that I can get rid of filesort by adding ORDER BY NULL
, however I'm trying to keep these basic sorting and aggregation calculations in one place, application code or SQL.)
Am I approaching this the wrong way? Here's a couple of alternate methods I had considered:
- Storing year, month and day in separate columns.
- Aggregate tables (à la materialized views… switching to postgresql is a potential option too).
- Let the application do it
Thanks!
Best Answer
Using temporary and filesort are not as evil as everyone makes them out to be.
Do not split a
DATE
into year, month, and day columns; it will cause more pain than it is worth.I prefer (in your specific case) to do
GROUP BY LEFT(c.date, 7)
, although it may not speed up anything."Summary Tables" is the way to go for that type of query, assuming you have large tables. More discussion in my blog. You would add one (maybe more) row to the summary table each night, then a query against that table could summarize any day range, including your month.
Do not store
AVG(...)
in a Summary table; it is probably more 'correct' to storeSUM(foo) AS sum_foo
andCOUNT(*) AS foo_ct
, then have the reportProbably the optimal summary table would have
and have
sum_conversions
calculated fromSUM(c.conversions)
, plus anything else relevant. Often, one summary table can handle a small set of standard 'reports'.