I'm having difficulty with GROUP BY
in MySQL.
My database setup:
client_visit
- id
- member_id
- status_type_id (type_of_visit table)
- visit_starts_at
- visit_ends_at
member
- id
schedule_event
- id
- member_id
- starts_at
- ends_at
type_of_visit
- id
- type (TYPE_BOOKED, TYPE_PRESENT etc)
For the purpose of this question: a member
teaches a class or leads an activity (a schedule_event
) at a given time. A client
signs up for this class or activity.
For example:
Client A, B and C book visits and those go to client_visit
table which consists of schedule_event_id
and member_id
, so we know what class and what member is teaching/or having activity.
Now, we want to know the total time a given member spent teaching/leading events that clients signed up for (based on the client_visit
type_of_visit
column equivalent to "Booked" or "Present"). We'll take member ID 82 as our test case.
Member ID 82 had 4 clients in two different classes, so if each class took 2h 15 minutes (8100 seconds), that means total time should be 16200 seconds.
Here's my query first:
SELECT cv.member_id AS `member_id`,
sch.id AS `scheduleId`,
cv.visit_starts_at AS `visitStartsAt`,
TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at) AS `totalTime`
FROM `schedule_event` AS `sch`
LEFT JOIN `client_visit` AS `cv` ON cv.schedule_event_id = sch.id
INNER JOIN `type_of_visit` AS `tov` ON tov.id = cv.status_type_id
WHERE (tov.type = 'TYPE_BOOKED' OR tov.type = 'TYPE_PRESENT') and cv.member_id = 82
This shows me the clients for the first class, and the one for the second. I just want two rows, one for each class. So, I add this:
GROUP BY sch.id
Now, the result is as follows:
So far so good,
I know that there are two schedule ids for this member, so I modified the group by to pull those together into one:
GROUP BY sch.id AND cv.member_id
I expect that it will first merge based on sch.id
(the result it's already shown in the picture above) and cv.member_id
(we got two rows, so after merging, should be one)
and the result is (I modified scheduleId by adding GROUP_CONCAT, so we can see both schedule IDs are there):
Now, just like I pulled together the two schedule IDs, I want to add up the time for the two scheduled classes.
I modify the query now:
SUM(TIMESTAMPDIFF(SECOND, sch.starts_at, sch.ends_at)) AS `totalTime`
I got 32400! For some reason, the SUM is still seeing all 4 rows instead of just the unique 2.
I expected that the final result would be
+-----------+------------+
| member_id | total_time |
+-----------+------------+
| 82 | 16200 |
+-----------+------------+
All other columns are not needed, I just made them to see what is happening
What's wrong?
Best Answer
As Willem Renzema said, you've misunderstood how
GROUP BY
works. Since it doesn't seem like you've understood what he said, let me try saying it a little differently.GROUP BY
, logically enough, is used to group together rows from your result set. Normally you provide a list of the columns to use to group your rows together.GROUP BY sch.id, cv.member_id
tells SQL to identify the unique sets of values for those two columns, and to group the rows in the result set by those values. In your case, there are two unique value pairs for those two values:cv.member_id
= 82,sch.id
= 17101cv.member_id
= 82,sch.id
= 17153So, you'll get two groups of rows - three that have the first pair of values, and one that has the second pair.
Adding additional columns to a
GROUP BY
clause will never result in fewer groups - either the new column(s) are the same in all rows (in which case you have the same number of groups), or the new column(s) have different values form some rows in one or more of hour original groups (in which case, you'll now have more groups).Also (as pointed out by Willem), you've got a syntax error. The columns in a
GROUP BY
list are separated by commas. In yourGROUP BY sch.id AND cv.member_id
, you're grouping by a calculation:sch.id AND cv.member_id
, or the result of treating bothsch.id
andcv.member_id
as if they were Boolean values. Since neither is 0, when converted to Booleans both evaluate to 1 (true), and the combination(true AND true)
is true. So, you wind up with just one group, of 4 rows.Let's step back, and consider what (it looks like) you're actually trying to do. For a given
member_id
, you want the total time they're involved in activities of the types "Booked" or "Present".Note that the total time is calculated out of the
schedule_event
table. Also, note that a givenmember_id
can be associated with the sameschedule_event
more than once. So, to get total time, we need to identify the distinctschedule_event
rows that ourmember_id
is tied to, and sum the time for those unique values.That being the case, the simplest way to proceed is to use a sub-query to get the list of distinct
schedule_events
ourmember_id
is tied to, and then sum the total times for those distinct events.Here's a query that will do just that:
The subquery (imaginatively labeled
sq
) is basically your original query. I changed yourLEFT JOIN
to anINNER JOIN
, as we must have aclient_visit
record to identify both themember_id
, and the type of visit. However, I removed theSUM
ontotalTime
; at this point, we just want to know the time eachschedule_event
will take. I also addedDISTINCT
- we don't care how many time thisschedule_event
appears with thismember_id
; the total time will be the same whether it shows up once, three times, or 207 times.Once we've identified the
schedule_event
data that ourmember_id
is connected to, then we want the total time for all thoseschedule_event
rows. So, we take the results of the sub-query, group them bymember_id
(in case it would ever be necessary to pull this back for multiplemember_id
values), and sum up the calculated times for eachschedule_event
row.Since joanolo had goen to the trouble to set up a dbfiddle for your problem, I took his work and added this query at the end, so you could see the results were what you wanted; the updated dbfiddle link is here.
I hope this helps clarify how
GROUP BY
actually works for you.