I have this setup
attendance table:
------------------------------------------
| member_id | attendance_date | event_id |
------------------------------------------
attendance indeces
---------------------------------------------
| p1 (member_id, attendance_date, event_id) |
| p2 (attendance_date, member_id, event_id) |
| member (member_id) |
| total (attendance_date, event_id) |
---------------------------------------------
members table
-------------------------------------------------
| member_id | department_id | registration_date |
-------------------------------------------------
members indeces:
-----------------------------------------------
| PK (member_id) |
| registration (registration_date, member_id) |
| department (department_id, member_id) |
-----------------------------------------------
So basically this is an attendance system. And I need to run reports based on the attendance table. I've figured out much of the queries needed except for the part where I need to know how many absences a member has for a given time. I pretty much know how to query for data that is there. But I do not know how will I query for data that is not there (absences).
I tried this:
SELECT
attendance_date,
COUNT(DISTINCT event_id)
FROM
attendance
WHERE
WEEKOFYEAR(attendance_date) BETWEEN 27 AND 31
GROUP BY
WEEKOFYEAR(attendance_date)
which gives me the number of events a member should've been able to attend based on other's attendance record, but the problem is when there is an event with no attendees at all (unlikely) or when there is a newly registered member.
I tried this:
EXPLAIN SELECT
m.member_id AS id,
COUNT(DISTINCT a.event_id)
FROM
members AS m
LEFT JOIN
attendance AS a
ON
m.registration_date < a.attendance_date AND
WEEKOFYEAR(a.attendance_date) BETWEEN 27 AND 31
WHERE
m.department_id = 1
But EXPLAIN
tells me that it will scan 4M rows so I know that it is not the way to go. I know that the query is wrong anyways since COUNT(DISTINCT) will give me multiple rows per member. Any thoughts? And any advice on how should I index the database for this query is also appreciated.
EDIT
To be more specific, this is what I want to do with the query:
- Get all members of a department
- Get the number of events that has a record in the attendance table between a given period but only those events that occurred later than the current member's registration date
When we subtract the total number of attendance records of the member, then we can actually get how many absences the member already has.
I know that part of this problem stems from a fundamental error in the structure of the application (events shouldn't be defined based on attendance records) but these events are recurring and are quite irregular in their occurrence patterns. So to minimize the need for maintenance, I just decided that I derive event information (or at least their count) be based on the attendance since it is very unlikely that an event will have 0 attendees.
Best Answer
Why are you determining a member's absence through another member's presence?
I guess there must be an
events
table in your schema. And I would expect it to have a column called something likeevent_date
. It's theevent_date
column that you should use to determine which events a member could visit (members.registration_date < events.event_date
).After getting the list of members and their respective valid events, you can anti-join it to the
attendance
table to see which member missed which (or how many) events.Here's an example query illustrating what I just said. It implements the anti-join using
LEFT JOIN
+WHERE IS NULL
check:Or here's another example showing how you could count total events available to a member and those actually attended by him/her:
Note that in the last example there's no
IS NULL
check. It's because that query needs to get all the events per member. Missed events are just not counted by theCOUNT()
function (because the correspondinga.event_id
value contains NULL in those cases).UPDATE
Since, as per your comment, the events are recurring (and thus have no fixed dates), I would suggest adding and using an
event_calendar
table, populated beforehand either manually or programmatically. Naturally, theevents
table in the above examples would be replaced withevent_calendar
.Alternatively, if maintaining an event calendar is not an option, you could replace
events
in your queries with(basically, @ypercube's suggestion with dates thrown in).