Mysql – Order by reduces performance of a grouped result set

group byMySQLorder-byperformance

I'm trying to optimize a query which uses a combination of GROUP BY and ORDER BY which together cause a big performance hit (~x15). If I remove either of the conditions, the query runs pretty fast. I tried different indexing strategies and query structures, but have come up empty so far.

The original query looks like this:

SELECT `academies_calendar`.`day`, `academies_calendar`.`start`, `academies_calendar`.`duration`,
 `academies_calendar`.`title`, `academies_calendar`.`id`, TIME_TO_SEC(TIMEDIFF('00:16:00', academies_calendar.start)) AS `diff`, 
 IF('00:16' BETWEEN academies_calendar.start - INTERVAL 60 MINUTE AND academies_calendar.start + INTERVAL (academies_calendar.duration + 60) MINUTE,1,0) AS `active`, 
 IF('00:16' BETWEEN academies_calendar.start - INTERVAL 60 MINUTE AND academies_calendar.start + INTERVAL (academies_calendar.duration) MINUTE,1,0) AS `starting`, 
 `academies_calendar`.`instructor_id`, CONCAT_WS(' ',academies_instructors.first_name,academies_instructors.last_name) AS `instructor`,
 COUNT(members_attendance.id) AS `attending`, GROUP_CONCAT( CONCAT_WS('>|<',members.first_name,members.last_name,members.academy_id,members.photo) SEPARATOR '<^>') AS `members` 
FROM `academies_calendar` 
LEFT JOIN `academies_instructors` ON academies_instructors.id=academies_calendar.instructor_id 
LEFT JOIN `members_attendance` ON members_attendance.event_id=academies_calendar.id AND (DATE(members_attendance.checked_in)=CURDATE()) 
LEFT JOIN `members` ON members.id=members_attendance.member_id 
WHERE (academies_calendar.academy_id=3) 
AND academies_calendar.day=2 
AND academies_calendar.start > CAST('00:16' as time) - INTERVAL (academies_calendar.duration + 60) MINUTE
GROUP BY `academies_calendar`.`id` 
ORDER BY ISNULL(academies_calendar.start),academies_calendar.start ASC

It's pulling a bit of data and organizing it, but due to the filters that hit indexes, it's usually working on small set of data – 1-6 entries in the academies_calendar table.

The EXPLAIN results for that query:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra   
1   SIMPLE  academies_calendar  ref     PRIMARY,academy_id  academy_id  6   const,const     5   Using where; Using temporary; Using filesort
1   SIMPLE  academies_instructors   eq_ref  PRIMARY     PRIMARY     4   maonrails.academies_calendar.instructor_id  1   NULL
1   SIMPLE  members_attendance  ref     event_id    event_id    5   maonrails.academies_calendar.id     5   Using where
1   SIMPLE  members     eq_ref  PRIMARY     PRIMARY     4   maonrails.members_attendance.member_id  1   Using where

The key used for the main table (academy_id) is a composite key over (academy_id,day,start). The rest are hitting indexes or primary keys. Only 5 rows are found for this specific query, and it complete in 0.015s

If I remove either the ORDER BY clause, or the GROUP_CONCAT() call, the query runs at around 0.0009s.

I'm not sure I understand this behavior – isn't the ORDER BY clause supposed to be applied after the WHERE clauses and GROUP clauses have been applied? why does sorting 5 rows increase query time by more than an order of magnitude? I get the same results if the query returns just 1 result, by the way.

What seems to be happening, is that the grouping multiplies the amount of time it needs to sort the result set by the amount of rows joined to it from the other tables. I tried to resolve that by replacing those joins with a subquery that groups the data and returns one result per row, but it did not have a significant impact on query run time.

Any ideas on how to optimize this query would be much appreciated!

Best Answer

  • Remove LEFT if you don't need it.
  • The ORDER BY has ISNULL(start), yet the WHERE clause will fail if start isNULL`. Seems like something could be simplified.
  • Will the table(s) become so big that they cannot be cached in RAM? If so, we can talk about the choice of PRIMARY KEY. (I'll need to see SHOW CREATE TABLE for this discussion.)
  • The query, as it stands, is destined to take some time...

It needs to look at all index rows with academy_id=3 and day=2, then filter on start. That is currently about 5 rows; will that increase when there is more data? Will you end up with a LIMIT?

Will the test on start be a significant factor in the filtering? If so we can discuss a 'lazy eval' technique.

Here's one thing that may help: Remove the LEFT JOIN members and replace the GROUP_CONCAT... with:

( SELECT GROUP_CONCAT(...) FROM members AS m
      WHERE m.id = members_attendance.member_id ) AS members,

Also, change

(DATE(members_attendance.checked_in)=CURDATE())

to

members_attendance.checked_in >= CURDATE() AND
members_attendance.checked_in  < CURDATE() + INTERVAL 1 DAY

and add

INDEX(event_id, checked_in)  -- in that order

(unless that table has PRIMARY KEY(event_id) -- it would help if you showed us SHOW CREATE TABLE)

Explanation

Some of the question deals with the slowdown due to ORDER BY. Let me walk you through some of the steps:

  1. fetch things from the first table, filtering based on the WHERE
  2. fetch things from JOINed tables, again filtering where appropriate
  3. GROUP BY id -- Since the joins created an explosion of rows in a tmp table, but you wanted only one row per id, this shrinks things back down. It may require a temp table and a sort. (Note: My suggestion on GROUP_CONCAT avoids some of the explosion.)
  4. ORDER BY something_else -- If the ORDER BY had been on the same value(s), it could be subsumed into the GROUP BY. But, since it is not, there needs to be another temp table and sort.

Caveat: "temp table" may or may not be entirely in RAM. "sort" may or may not be in RAM. If either of these hits disk, the query will be slower.

Caveat: If any of the SELECTed columns are TEXT or BLOB, the temp table will be headed for disk. (The whole story is more complex; let's see SHOW CREATE TABLE.)

It may be possible to get rid of all JOINs, such as my suggestion above. If so, then you can get rid of the GROUP BY, thereby speeding things up.

Do what you can in the directions I suggest, then let's make another pass at it. (There may need to be further index changes.)

<my-rant>
    3rd party software tries to help with DB abstraction,
    but that usually gets in the way of performance.
</my-rant>