Mysql – Improving performance for sum and group by with large datasets

group byMySQL

I have a query that looked like this:

SELECT sum(score) as score, count(distinct(user_id)) number, round(sum(score)/count(distinct(user_id))) score_per_member, teams.*, types.name as type 
FROM `teams` inner join `users` on `teams`.`id` = `users`.`team_id` 
inner join `results` on `results`.`user_id` = `users`.`id`
inner join `types` on `types`.`id` = `teams`.`type_id`
WHERE `results`.`year` = '2015'
GROUP BY `team_id`
ORDER BY `score` DESC;  

Users has 250k rows. Results has 4m rows with 1m with year 2015.

Although explain doesn't look too bad:

1   SIMPLE  teams   ALL PRIMARY,teams_type_id_foreign   NULL    NULL    NULL    4449    Using temporary; Using filesort
1   SIMPLE  types   eq_ref  PRIMARY PRIMARY 4   teams.type_id   1   
1   SIMPLE  users   ref PRIMARY,users_team_id_foreign   users_team_id_foreign   5   teams.id    57  Using where; Using index
1   SIMPLE  results ref user_id,results_year_user_id_index  results_year_user_id_index  6   const,users.id  6

It was taking 50 seconds to run.

I refactored it to:

SELECT score, number, round(score/number) score_per_member, teams.*, types.name as type FROM `teams`  
    inner join 
        (
            select sum(score) score, team_id, count(distinct(user_id)) number
            from `results` 
            join users on `results`.`user_id` = `users`.`id`
            WHERE `year` = '2015' and year_id is not null
            GROUP BY `team_id`
    ) r on teams.id = r.team_id
    inner join `types` on `types`.`id` = `teams`.`type_id` ORDER BY `score` DESC;

Explain actually 'looks' much worse:

1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2045    Using filesort
1   PRIMARY teams   eq_ref  PRIMARY,teams_type_id_foreign   PRIMARY 4   r.team_id   1   
1   PRIMARY types   eq_ref  PRIMARY PRIMARY 4   teams.type_id   1   
2   DERIVED users   range   PRIMARY,users_team_id_foreign   users_team_id_foreign   5   NULL    132496  Using where; Using index
2   DERIVED results ref user_id,results_year_user_id_index  results_year_user_id_index  6   users.id    6   

But the performance is now much, much better (2 seconds).

How could I improve queries like this even more? Is it just a case of throwing more hardware at it?

Best Answer

The big issue is "inflation-deflation" when you do JOIN + GROUP BY.

Think of what JOIN does -- it generates an intermediate table with teams * types/team * users/type/team * results/... That is, that table is much bigger than any of the original tables.

Then it does the GROUP BY do deflate that back to one row per team.

Usually the combination gives COUNTs and SUMs that are bigger than expected. (Did you check?) And it takes a long time.

So, the trick is to use subqueries to avoid any single query with both JOIN and GROUP BY in it. (The combo is ok if the JOIN is known to be 1:1, which seems to be the case for teams and types, but not the other JOINs.)

There are two ways to do such subqueries:

SELECT
        ( SELECT SUM() FROM ... WHERE ... ) -- correlated subquery; single result only
    FROM ...

The subquery is effectively a GROUP BY without saying it, but it boils down to a single value, thereby avoiding the "inflation".

and

SELECT ...
    FROM ( SELECT COUNT(), SUM(...) FROM ... GROUP BY ...
         ) AS x  -- tmp table for some of the data; no JOIN
    JOIN ...

The subquery "deflates" some of the data before getting to the JOIN.

Caution: Using multiple subqueries of the second kind is slow because of lack of indexes:

SELECT ...
    FROM ( SELECT ... ) a
    JOIN ( SELECT ... ) b ON ...

5.6 helps somewhat because it discovers and adds an index.