Mysql – Count consecutive null rows from a joined table

join;MySQL

I have the following tables:

members:
----------------------------------------------
| member_id | member_name | member_join_date |
----------------------------------------------  

events:
-------------------------
| event_id | event_name |
-------------------------

event_dates:
--------------------------------------
| event_id | event_date | event_week |
--------------------------------------

attendance:
------------------------------------------------------------
| member_id | event_id | attendance_date | attendance_week |
------------------------------------------------------------

So I have members who joined the organization at a certain time and they may attend weekly events (the occurrences of which are monitored by the event_dates table)

What I want to do is get all members that have at least 3-consecutive absences

I did all I could and I think I have each part of the query but don't know how to merge them together.

Get all events that they could've attended:

    SELECT
      m.member_id AS id,
      ed.event_id,
      ed.event_week
    FROM
      members AS m
      LEFT JOIN
        event_dates AS ed
        ON
          ed.event_date >= m.member_join_date

Get all the events they actually attended:

    SELECT
      m.member_id AS id,
      a.event_id
    FROM
      members AS m
      LEFT JOIN
        attendance AS a
        ON
          a.member_id = m.member_id

And probably the answer to this question if I manage to merge the two queries into one:

https://stackoverflow.com/questions/1321670/count-of-consecutive-not-null-values

But this time I want to count NULL values

Best Answer

This is essentially a gaps-and-islands problem. And when I have my SQL Server hat on, I often solve this kind of problem with two ROW_NUMBER() calls. Sadly, MySQL, unlike many other major SQL products, does not support ROW_NUMBER(), nor any other ranking function. To make up for that, however, you can use variable assignment in SELECTs, which MySQL does support (unlike many other major SQL products).

Below is a solution followed by an explanation:

SELECT
  member_id,
  member_name,
  event_id,
  COUNT(*) AS consecutive_times_missed,
  MIN(event_date) AS first_date_missed,
  MAX(event_date) AS last_date_missed
FROM (
  SELECT
    member_id,
    member_name,
    event_id,
    event_date,
    is_missed,
    @occ_ranking := (event_id = @last_event) * (member_id = @last_member) * @occ_ranking + 1,
    @att_ranking := (event_id = @last_event) * (member_id = @last_member)
                                             * (is_missed = @last_missed) * @att_ranking + 1,
    @occ_ranking - @att_ranking AS grp,
    @last_member := member_id,
    @last_event  := event_id,
    @last_missed := is_missed
  FROM (
    SELECT
      m.member_id,
      m.member_name,
      e.event_id,
      e.event_date,
      (a.attendance_date IS NULL) AS is_missed
    FROM       members     m
    INNER JOIN event_dates e ON m.member_join_date <= e.event_date
    LEFT  JOIN attendance  a ON m.member_id = a.member_id
                            AND e.event_id = a.event_id
                            AND e.event_date = a.attendance_date,
    (
      SELECT
        @occ_ranking := 0,
        @att_ranking := 0,
        @last_member := 0,
        @last_event  := 0,
        @last_missed := 0
    ) v
    ORDER BY
      m.member_id,
      e.event_date
  ) s
) s
WHERE
  is_missed = 1
GROUP BY
  member_id,
  member_name,
  event_id,
  grp
HAVING
  COUNT(*) >= 3
;

Basically, you start with joining members and event_dates to get all the event occurrences the members could have attended based on their membership dates. Then you throw in the attendance table (via a left join) to flag the missed occurrences. Here's an example of what you get by this time:

member_id  event_id  event_date  is_missed
---------  --------  ----------  ---------
1          1         2012-07-10  0
2          1         2012-07-10  1
1          1         2012-07-14  0
2          1         2012-07-14  1
…          …         …           …

At this point, the resulting set needs to be sorted by member_id, event_id, because that is crucial to the subsequent rankings calculation.

Two different rankings are produced for every row. One is a ranking within the row's partition of (member_id, event_id) (it is reset as soon as a new event or a new member is encountered). The other is a ranking within the specific group of consecutive event occurrences, either attended or missed, that the row belongs to (this ranking, in addition to being reset upon coming across a new member or event, also gets reset whenever the other group is encountered). And so you get something like this:

member_id  event_id  event_date  is_missed  ranking1  ranking2
---------  --------  ----------  ---------  --------  --------
…          …         …           …          …         …
1          1         2012-07-27  0          4         4
1          1         2012-07-28  0          5         5
1          1         2012-07-29  1          6         1
2          1         2012-07-10  1          1         1
2          1         2012-07-14  1          2         2
2          1         2012-07-25  1          3         3
2          1         2012-07-27  0          4         1
…          …         …           …          …         …

As you may have noticed, the difference between the two ranking numbers is constant throughout the particular group of consecutive event occurrences of the same kind ("attended" or "missed") and is also unique for that group within its partition. Therefore, every such group can be identified by member_id, event_id and the just mentioned difference.

And it now remains simply to filter the events leaving just the missed ones, group the rows and get the necessary aggregated data, like the number of rows and, possibly, as in the query above, the dates of the first and the last event occurrence in the group. The number of rows is also used in an additional filter condition to omit groups with fewer rows than required.

UPDATE 2012-02-21 17:06 EDT

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
TAB_EVENT e LEFT JOIN
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

@Sebastian, I put the query back in the original join order. Please try this as well !!!

UPDATE 2012-02-21 17:11 EDT

Make sure the datetime field is indexed

ALTER TABLE TAB_MESSAGE ADD INDEX read_datetime_ndx (read,datetime);

Mysql – Query for data that is not there

Why are you determining a member's absence through another member's presence?

I guess there must be an events table in your schema. And I would expect it to have a column called something like event_date. It's the event_date column that you should use to determine which events a member could visit (members.registration_date < events.event_date).

After getting the list of members and their respective valid events, you can anti-join it to the attendance table to see which member missed which (or how many) events.

Here's an example query illustrating what I just said. It implements the anti-join using LEFT JOIN + WHERE IS NULL check:

SELECT
  m.member_id,
  e.event_id
FROM       members    m
INNER JOIN events     e ON m.registration_date < e.event_date
LEFT  JOIN attendance a ON m.member_id = a.member_id
                       AND e.event_id  = a.event_id
WHERE a.member_id IS NULL
  AND ... /* some other conditions to filter members and/or events, if necessary */

Or here's another example showing how you could count total events available to a member and those actually attended by him/her:

SELECT
  m.member_id,
  COUNT(e.event_id) AS total_events,
  COUNT(a.event_id) AS attended_events,
FROM       members    m
INNER JOIN events     e ON m.registration_date < e.event_date
LEFT  JOIN attendance a ON m.member_id = a.member_id
                       AND e.event_id  = a.event_id
WHERE /* some conditions as necessary */

Note that in the last example there's no IS NULL check. It's because that query needs to get all the events per member. Missed events are just not counted by the COUNT() function (because the corresponding a.event_id value contains NULL in those cases).

UPDATE

Since, as per your comment, the events are recurring (and thus have no fixed dates), I would suggest adding and using an event_calendar table, populated beforehand either manually or programmatically. Naturally, the events table in the above examples would be replaced with event_calendar.

Alternatively, if maintaining an event calendar is not an option, you could replace events in your queries with

(
  SELECT DISTINCT
    event_id,
    attendance_date AS event_date
    /* or, if they are not purely dates:
    CAST(attendance_date AS date) AS event_date
    */
  FROM attendance
) e

(basically, @ypercube's suggestion with dates thrown in).

Best Answer

Related Solutions

Mysql – Limit WHERE to MAX() & COUNT()

UPDATE 2012-02-21 17:06 EDT

UPDATE 2012-02-21 17:11 EDT

Mysql – Query for data that is not there

Related Question