Mysql – SUM() ignores GROUP BY and sums up 4 rows instead of 2

group byMySQLsum

I'm having difficulty with GROUP BY in MySQL.

My database setup:

client_visit
    - id
    - member_id
    - status_type_id (type_of_visit table)
    - visit_starts_at
    - visit_ends_at

member
    - id

schedule_event
    - id
    - member_id
    - starts_at
    - ends_at

type_of_visit
    - id
    - type (TYPE_BOOKED, TYPE_PRESENT etc)

For the purpose of this question: a member teaches a class or leads an activity (a schedule_event) at a given time. A client signs up for this class or activity.

For example:

Client A, B and C book visits and those go to client_visit table which consists of schedule_event_id and member_id, so we know what class and what member is teaching/or having activity.

Now, we want to know the total time a given member spent teaching/leading events that clients signed up for (based on the client_visit type_of_visit column equivalent to "Booked" or "Present"). We'll take member ID 82 as our test case.

Member ID 82 had 4 clients in two different classes, so if each class took 2h 15 minutes (8100 seconds), that means total time should be 16200 seconds.

Here's my query first:

SELECT cv.member_id AS `member_id`,
       sch.id AS `scheduleId`,
       cv.visit_starts_at AS `visitStartsAt`,
       TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at) AS `totalTime`
  FROM `schedule_event` AS `sch`
         LEFT JOIN `client_visit` AS `cv` ON cv.schedule_event_id = sch.id
         INNER JOIN `type_of_visit` AS `tov` ON tov.id = cv.status_type_id
 WHERE (tov.type = 'TYPE_BOOKED' OR tov.type = 'TYPE_PRESENT') and cv.member_id = 82

The result is as follows:
enter image description here

This shows me the clients for the first class, and the one for the second. I just want two rows, one for each class. So, I add this:

 GROUP BY sch.id

Now, the result is as follows:
enter image description here

So far so good,

I know that there are two schedule ids for this member, so I modified the group by to pull those together into one:

 GROUP BY sch.id AND cv.member_id

I expect that it will first merge based on sch.id (the result it's already shown in the picture above) and cv.member_id (we got two rows, so after merging, should be one)

and the result is (I modified scheduleId by adding GROUP_CONCAT, so we can see both schedule IDs are there):
enter image description here

Now, just like I pulled together the two schedule IDs, I want to add up the time for the two scheduled classes.

I modify the query now:

SUM(TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at)) AS `totalTime`

And the result is:
enter image description here
enter image description here

I got 32400! For some reason, the SUM is still seeing all 4 rows instead of just the unique 2.

I expected that the final result would be

+-----------+------------+
| member_id | total_time |
+-----------+------------+
| 82        | 16200      |
+-----------+------------+

All other columns are not needed, I just made them to see what is happening

What's wrong?

Best Answer

As Willem Renzema said, you've misunderstood how GROUP BY works. Since it doesn't seem like you've understood what he said, let me try saying it a little differently.

GROUP BY, logically enough, is used to group together rows from your result set. Normally you provide a list of the columns to use to group your rows together. GROUP BY sch.id, cv.member_id tells SQL to identify the unique sets of values for those two columns, and to group the rows in the result set by those values. In your case, there are two unique value pairs for those two values:

  • cv.member_id = 82, sch.id = 17101
  • cv.member_id = 82, sch.id = 17153

So, you'll get two groups of rows - three that have the first pair of values, and one that has the second pair.

Adding additional columns to a GROUP BY clause will never result in fewer groups - either the new column(s) are the same in all rows (in which case you have the same number of groups), or the new column(s) have different values form some rows in one or more of hour original groups (in which case, you'll now have more groups).

Also (as pointed out by Willem), you've got a syntax error. The columns in a GROUP BY list are separated by commas. In your GROUP BY sch.id AND cv.member_id, you're grouping by a calculation: sch.id AND cv.member_id, or the result of treating both sch.id and cv.member_id as if they were Boolean values. Since neither is 0, when converted to Booleans both evaluate to 1 (true), and the combination (true AND true) is true. So, you wind up with just one group, of 4 rows.

Let's step back, and consider what (it looks like) you're actually trying to do. For a given member_id, you want the total time they're involved in activities of the types "Booked" or "Present".

Note that the total time is calculated out of the schedule_event table. Also, note that a given member_id can be associated with the same schedule_event more than once. So, to get total time, we need to identify the distinct schedule_event rows that our member_id is tied to, and sum the time for those unique values.

That being the case, the simplest way to proceed is to use a sub-query to get the list of distinct schedule_events our member_id is tied to, and then sum the total times for those distinct events.

Here's a query that will do just that:

SELECT `member_id`
      ,SUM(`totalTime`) as `totalTime`
  FROM (
        SELECT DISTINCT
            cv.member_id AS `member_id`,
            sch.id AS `scheduleId`,
            TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at) AS `totalTime`
        FROM 
            `schedule_event` AS `sch`
            INNER JOIN `client_visit` AS `cv` ON cv.schedule_event_id = sch.id
            INNER JOIN `type_of_visit` AS `tov` ON tov.id = cv.status_type_id
        WHERE 
            (tov.type = 'TYPE_BOOKED' OR tov.type = 'TYPE_PRESENT') 
            AND cv.member_id = 82
       ) sq
 GROUP BY `member_id`;

The subquery (imaginatively labeled sq) is basically your original query. I changed your LEFT JOIN to an INNER JOIN, as we must have a client_visit record to identify both the member_id, and the type of visit. However, I removed the SUM on totalTime; at this point, we just want to know the time each schedule_event will take. I also added DISTINCT - we don't care how many time this schedule_event appears with this member_id; the total time will be the same whether it shows up once, three times, or 207 times.

Once we've identified the schedule_event data that our member_id is connected to, then we want the total time for all those schedule_event rows. So, we take the results of the sub-query, group them by member_id (in case it would ever be necessary to pull this back for multiple member_id values), and sum up the calculated times for each schedule_event row.

Since joanolo had goen to the trouble to set up a dbfiddle for your problem, I took his work and added this query at the end, so you could see the results were what you wanted; the updated dbfiddle link is here.

I hope this helps clarify how GROUP BY actually works for you.