I'm trying to optimize a query which uses a combination of GROUP BY and ORDER BY which together cause a big performance hit (~x15). If I remove either of the conditions, the query runs pretty fast. I tried different indexing strategies and query structures, but have come up empty so far.
The original query looks like this:
SELECT `academies_calendar`.`day`, `academies_calendar`.`start`, `academies_calendar`.`duration`,
`academies_calendar`.`title`, `academies_calendar`.`id`, TIME_TO_SEC(TIMEDIFF('00:16:00', academies_calendar.start)) AS `diff`,
IF('00:16' BETWEEN academies_calendar.start - INTERVAL 60 MINUTE AND academies_calendar.start + INTERVAL (academies_calendar.duration + 60) MINUTE,1,0) AS `active`,
IF('00:16' BETWEEN academies_calendar.start - INTERVAL 60 MINUTE AND academies_calendar.start + INTERVAL (academies_calendar.duration) MINUTE,1,0) AS `starting`,
`academies_calendar`.`instructor_id`, CONCAT_WS(' ',academies_instructors.first_name,academies_instructors.last_name) AS `instructor`,
COUNT(members_attendance.id) AS `attending`, GROUP_CONCAT( CONCAT_WS('>|<',members.first_name,members.last_name,members.academy_id,members.photo) SEPARATOR '<^>') AS `members`
FROM `academies_calendar`
LEFT JOIN `academies_instructors` ON academies_instructors.id=academies_calendar.instructor_id
LEFT JOIN `members_attendance` ON members_attendance.event_id=academies_calendar.id AND (DATE(members_attendance.checked_in)=CURDATE())
LEFT JOIN `members` ON members.id=members_attendance.member_id
WHERE (academies_calendar.academy_id=3)
AND academies_calendar.day=2
AND academies_calendar.start > CAST('00:16' as time) - INTERVAL (academies_calendar.duration + 60) MINUTE
GROUP BY `academies_calendar`.`id`
ORDER BY ISNULL(academies_calendar.start),academies_calendar.start ASC
It's pulling a bit of data and organizing it, but due to the filters that hit indexes, it's usually working on small set of data – 1-6 entries in the academies_calendar
table.
The EXPLAIN
results for that query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE academies_calendar ref PRIMARY,academy_id academy_id 6 const,const 5 Using where; Using temporary; Using filesort
1 SIMPLE academies_instructors eq_ref PRIMARY PRIMARY 4 maonrails.academies_calendar.instructor_id 1 NULL
1 SIMPLE members_attendance ref event_id event_id 5 maonrails.academies_calendar.id 5 Using where
1 SIMPLE members eq_ref PRIMARY PRIMARY 4 maonrails.members_attendance.member_id 1 Using where
The key used for the main table (academy_id
) is a composite key over (academy_id,day,start). The rest are hitting indexes or primary keys. Only 5 rows are found for this specific query, and it complete in 0.015s
If I remove either the ORDER BY clause, or the GROUP_CONCAT() call, the query runs at around 0.0009s.
I'm not sure I understand this behavior – isn't the ORDER BY clause supposed to be applied after the WHERE clauses and GROUP clauses have been applied? why does sorting 5 rows increase query time by more than an order of magnitude? I get the same results if the query returns just 1 result, by the way.
What seems to be happening, is that the grouping multiplies the amount of time it needs to sort the result set by the amount of rows joined to it from the other tables. I tried to resolve that by replacing those joins with a subquery that groups the data and returns one result per row, but it did not have a significant impact on query run time.
Any ideas on how to optimize this query would be much appreciated!
Best Answer
LEFT
if you don't need it.ORDER BY
hasISNULL(start)
, yet theWHERE
clause will fail ifstart is
NULL`. Seems like something could be simplified.PRIMARY KEY
. (I'll need to seeSHOW CREATE TABLE
for this discussion.)It needs to look at all index rows with
academy_id=3 and day=2
, then filter onstart
. That is currently about 5 rows; will that increase when there is more data? Will you end up with aLIMIT
?Will the test on
start
be a significant factor in the filtering? If so we can discuss a 'lazy eval' technique.Here's one thing that may help: Remove the
LEFT JOIN members
and replace theGROUP_CONCAT...
with:Also, change
to
and add
(unless that table has
PRIMARY KEY(event_id)
-- it would help if you showed usSHOW CREATE TABLE
)Explanation
Some of the question deals with the slowdown due to
ORDER BY
. Let me walk you through some of the steps:WHERE
JOINed
tables, again filtering where appropriateGROUP BY id
-- Since the joins created an explosion of rows in a tmp table, but you wanted only one row per id, this shrinks things back down. It may require a temp table and a sort. (Note: My suggestion onGROUP_CONCAT
avoids some of the explosion.)ORDER BY something_else
-- If theORDER BY
had been on the same value(s), it could be subsumed into theGROUP BY
. But, since it is not, there needs to be another temp table and sort.Caveat: "temp table" may or may not be entirely in RAM. "sort" may or may not be in RAM. If either of these hits disk, the query will be slower.
Caveat: If any of the
SELECTed
columns areTEXT
orBLOB
, the temp table will be headed for disk. (The whole story is more complex; let's seeSHOW CREATE TABLE
.)It may be possible to get rid of all
JOINs
, such as my suggestion above. If so, then you can get rid of theGROUP BY
, thereby speeding things up.Do what you can in the directions I suggest, then let's make another pass at it. (There may need to be further index changes.)