MySQL count rows from other table AND get most recent title

MySQL

I have a situation which can be boiled down to the following:

TABLE students
COLUMNS id, name, email

TABLE attendances
COLUMNS id, student_id, event_id

TABLE events
COLUMNS id, name, date_of_event

I want to be able to see all the students, and how many events they've attended, so I have:

select name, email, count(attendances.id) as number_of_shows_attended from students
left join attendances
   on student_id = students.id
group by name, email

RESULT

Jane Doe | jane@uni.edu | 0
John Doe | john@uni.edu | 10
Will Doe | will@uni.edu | 2

Perfect. Except I also wish to have a final column which lists the most recent event attended (which can be based on the events.date_of_event date column).

I have tried simply adding the missing column, like so:

select name, email, count(attendances.id) as number_of_shows_attended, 
   event.name as most_recent_event_attended from students 
left join attendances
   on student_id = students.id
inner join events
   on attendances.event_id on events.id
group by name, email

It gives me an event name as expected, but how can I ensure it is the most recent one? I tried adding order by events.date_of_event desc, but of course it's sorting the results of the query.

Best Answer

You could do something like this (see fiddle here - it shows development of all stages of the query). There is a small possible flaw in your design - see discussion and (slightly changed) DDL and DML in an alternative schema available here (with slightly changed SQL also!).

In future, could you please provide us with some sample tables (DDL) data and data (DML) so that we have a Single Source of Truth for the question - it avoids errors and eliminates duplication of effort on behalf of those trying to help you!

I couldn't test my solution because I didn't have any data and I've stopped doing my own data mockups since when I do, the OP (frequently) comes back with something like "Oh, but, if I put in records with x, y and z...) your solution doens't work..." - that's then me then going #FF0000/rgb(255,0,0) and swearing at the computer (and the OP)...

I simulated data and tested - my previous solution fails - this one now works - it even takes account (through LEFT JOINs) of students who haven't attended any events!

Please test with real datasets with a large number of records and apply indexes to your tables' fields as appropriate. You can use MySQL 8.0.18 and above's new EXPLAIN ANALYZE (and see here) functionality to get a good idea of where indexes are required! If you run into problems, ask a new question here.

CREATE TABLE student
(
  s_id INT NOT NULL PRIMARY KEY,
  s_name VARCHAR (100) NOT NULL,
  s_email VARCHAR (255) NOT NULL
);

CREATE TABLE event
(
  e_id INT NOT NULL PRIMARY KEY,
  e_name VARCHAR (100) NOT NULL,
  e_date_scheduled DATE NOT NULL
);

CREATE TABLE attendance
(
  a_id INT NOT NULL PRIMARY KEY,
  a_sid INT NOT NULL,
  CONSTRAINT a_sid_fk FOREIGN KEY (a_sid) REFERENCES student(s_id),
  a_eid INT NOT NULL,
  CONSTRAINT a_eid_fk  FOREIGN KEY (a_eid) REFERENCES event(e_id)
);

Populate the tables:

INSERT INTO student VALUES
(1, 'name_1', 'email_1@domain_1.net'), 
(2, 'name_2', 'email_2@domain_2.net'), 
(3, 'name_3', 'email_3@domain_3.net'),
(4, 'name_4', 'email_3@domain_3.net'); --   <<<---- attends no events!

INSERT INTO event VALUES
(1, 'event_1', '2020-01-01'),
(2, 'event_2', '2020-02-01'),
(3, 'event_3', '2020-03-01'),
(4, 'event_4', '2020-04-01'),
(5, 'event_5', '2020-05-01'),
(6, 'event_6', '2020-06-01');

INSERT INTO attendance VALUES
(1, 1, 1),
(2, 1, 2),
(3, 1, 3),
(4, 1, 4),


(5, 2, 2),
(6, 2, 3),
(7, 2, 5),

(8, 3, 5),
(9, 3, 6);

The run the query:

SELECT s.s_id AS sid, s.s_name AS snam, s.s_email AS smail, 
       COUNT(a.a_id) AS a_cnt, t2.eds, COALESCE(t2.enam, '-----') AS "Latest event"
FROM
  student s
  LEFT JOIN attendance a
    ON s.s_id = a.a_sid
  JOIN
  (

    SELECT sid, eid, eds, enam
    FROM
    (
      SELECT s2.s_id AS sid, e.e_id AS eid, e.e_date_scheduled AS eds, e.e_name AS enam,
      ROW_NUMBER() OVER (PARTITION BY s2.s_id ORDER BY s2.s_id, e.e_id DESC) AS rn
      FROM 
        student s2
      LEFT JOIN attendance a2 
        ON s2.s_id = a2.a_sid
      LEFT JOIN event e
        ON a2.a_eid = e.e_id
    ORDER BY s2.s_id, e.e_date_scheduled DESC
  ) AS t
  WHERE t.rn = 1
) AS t2
ON s.s_id = t2.sid
GROUP BY s.s_id, s.s_name, s.s_email, t2.eds, t2.enam
ORDER BY s.s_id;

Result:

sid snam                   smail  a_cnt         eds  coalesce
1   name_1  email_1@domain_1.net      4  2020-04-01   event_4
2   name_2  email_2@domain_2.net      3  2020-05-01   event_5
3   name_3  email_3@domain_3.net      2  2020-06-01   event_6
4   name_4  email_3@domain_3.net      0        NULL     -----

Alternative schema:

The "flaw" to which I referred at the beginning is that you have an attendance table with an identity type PRIMARY KEY - it is not required - at least in this case.

Your use case (at least for this query) is a classic example of the JOINing table design pattern - aka a linking table (and many other names - there are 16 listed on the reference below). The formal name for this type of database object is an Associative Entity and the PK for such a table is normally just the two JOINing fields - no need for an identity column at all!

So, your attendance table definition should look more like this:

CREATE TABLE attendance
(
  -- a_id INT NOT NULL PRIMARY KEY,   -- << NOT necessary!
  a_sid INT NOT NULL,
  a_eid INT NOT NULL,
  CONSTRAINT a_s_e_pk PRIMARY KEY (a_sid, a_eid),
  CONSTRAINT a_sid_fk FOREIGN KEY (a_sid) REFERENCES student(s_id),
  CONSTRAINT a_eid_fk  FOREIGN KEY (a_eid) REFERENCES event(e_id)
);

The final query is now as below - the joins are now done exclusively using the JOINing fields and not the, no longer needed, surrogate PRIMARY KEY.

SELECT s.s_id AS sid, s.s_name AS snam, s.s_email AS smail, 
       COUNT(a.a_sid) AS a_cnt, t2.eds, COALESCE(t2.enam, '-----') AS "Latest event"
FROM
  student s
  LEFT JOIN attendance a
    ON s.s_id = a.a_sid
  JOIN
  (

    SELECT sid, eid, eds, enam
    FROM
    (
      SELECT s2.s_id AS sid, e.e_id AS eid, e.e_date_scheduled AS eds, e.e_name AS enam,
      ROW_NUMBER() OVER (PARTITION BY s2.s_id ORDER BY s2.s_id, e.e_id DESC) AS rn
      FROM 
        student s2
      LEFT JOIN attendance a2 
        ON s2.s_id = a2.a_sid
      LEFT JOIN event e
        ON a2.a_eid = e.e_id
    ORDER BY s2.s_id, e.e_date_scheduled DESC
  ) AS t
  WHERE t.rn = 1
) AS t2
ON s.s_id = t2.sid
GROUP BY s.s_id, s.s_name, s.s_email, t2.eds, t2.enam
ORDER BY s.s_id, t2.eds  -- << Can invert depending on requirements

Result identical!

Performance seems very similar - see the profiling at the end of the fiddle - however, the composite PK appears on average - tried 5 runs of each - to be slightly better but YMMV depending on your data distribution?