I have a query that looked like this:
SELECT sum(score) as score, count(distinct(user_id)) number, round(sum(score)/count(distinct(user_id))) score_per_member, teams.*, types.name as type
FROM `teams` inner join `users` on `teams`.`id` = `users`.`team_id`
inner join `results` on `results`.`user_id` = `users`.`id`
inner join `types` on `types`.`id` = `teams`.`type_id`
WHERE `results`.`year` = '2015'
GROUP BY `team_id`
ORDER BY `score` DESC;
Users has 250k rows. Results has 4m rows with 1m with year 2015.
Although explain doesn't look too bad:
1 SIMPLE teams ALL PRIMARY,teams_type_id_foreign NULL NULL NULL 4449 Using temporary; Using filesort
1 SIMPLE types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
1 SIMPLE users ref PRIMARY,users_team_id_foreign users_team_id_foreign 5 teams.id 57 Using where; Using index
1 SIMPLE results ref user_id,results_year_user_id_index results_year_user_id_index 6 const,users.id 6
It was taking 50 seconds to run.
I refactored it to:
SELECT score, number, round(score/number) score_per_member, teams.*, types.name as type FROM `teams`
inner join
(
select sum(score) score, team_id, count(distinct(user_id)) number
from `results`
join users on `results`.`user_id` = `users`.`id`
WHERE `year` = '2015' and year_id is not null
GROUP BY `team_id`
) r on teams.id = r.team_id
inner join `types` on `types`.`id` = `teams`.`type_id` ORDER BY `score` DESC;
Explain actually 'looks' much worse:
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2045 Using filesort
1 PRIMARY teams eq_ref PRIMARY,teams_type_id_foreign PRIMARY 4 r.team_id 1
1 PRIMARY types eq_ref PRIMARY PRIMARY 4 teams.type_id 1
2 DERIVED users range PRIMARY,users_team_id_foreign users_team_id_foreign 5 NULL 132496 Using where; Using index
2 DERIVED results ref user_id,results_year_user_id_index results_year_user_id_index 6 users.id 6
But the performance is now much, much better (2 seconds).
How could I improve queries like this even more? Is it just a case of throwing more hardware at it?
Best Answer
The big issue is "inflation-deflation" when you do
JOIN
+GROUP BY
.Think of what
JOIN
does -- it generates an intermediate table with teams * types/team * users/type/team * results/... That is, that table is much bigger than any of the original tables.Then it does the
GROUP BY
do deflate that back to one row per team.Usually the combination gives
COUNTs
andSUMs
that are bigger than expected. (Did you check?) And it takes a long time.So, the trick is to use subqueries to avoid any single query with both
JOIN
andGROUP BY
in it. (The combo is ok if theJOIN
is known to be 1:1, which seems to be the case forteams
andtypes
, but not the otherJOINs
.)There are two ways to do such subqueries:
The subquery is effectively a
GROUP BY
without saying it, but it boils down to a single value, thereby avoiding the "inflation".and
The subquery "deflates" some of the data before getting to the
JOIN
.Caution: Using multiple subqueries of the second kind is slow because of lack of indexes:
5.6 helps somewhat because it discovers and adds an index.