Mysql – How to get top two tags

greatest-n-per-groupMySQL

Here are my tables's structures:

-- users
+----+--------+
| id |  name  |
+----+--------+
| 1  | Jack   |
| 2  | Peter  |
| 3  | Ali    |
+----+--------+

-- reputations
+----+-------------+---------+-------+------------+------------+
| id | post_id     | user_id | score | reputation | date_time  |
+----+-------------+---------+-------+------------+------------+ -- Suppose:
| 1  | 1           | 1       | 1     | 5          | 1500489844 | -- out of last week
| 2  | 4           | 3       | -1    | -2         | 1500499815 | -- out of last week
| 3  | 2           | 3       | 1     | 5          | 1500584821 |
| 4  | 3           | 1       | 1     | 5          | 1501389166 |
| 5  | 2           | 4       | 1     | 5          | 1501399142 |
| 6  | 2           | 1       | -1    | -2         | 1501399142 |
| 7  | 4           | 1       | 0     | 15         | 1501481186 |
+----+-------------+---------+-------+------------+------------+
-- Note: the last row came from an accepted-answer, that's why its score is 0

-- post_tag
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1       | 2      |
| 1       | 4      |
| 2       | 2      |
| 3       | 1      |
| 3       | 4      |
| 4       | 3      |
+---------+--------+

-- tags
+----+--------+
| id |  name  |
+----+--------+
| 1  | php    |
| 2  | html   |
| 3  | css    |
| 4  | mysql  |
+----+--------+

-- post_category
+---------+-------------+
| post_id | category_id |
+---------+-------------+
| 1       | 2           |
| 2       | 3           |
| 3       | 1           |
| 4       | 2           |
| 5       | 1           |
| 6       | 2           |
| 7       | 1           |
+---------+-------------+
-- there is an unique index on post_id column. since each post can has only one category

-- categories
+----+------------+
| id |    name    |
+----+------------+
| 1  | social     |
| 2  | technology |
| 3  | political  |
+----+------------+

Now I want to get list of users with their reputations, scores, top two tags in last week. How can I do that?

Note1: "top tow tags" means the two tags that the user has earned more repo in them.

Note2: Each question has at least 1 tag.


Here is the expected output:

+----+--------+-------+------------+----------+------------+
| id |  name  | score | reputation |   tags   |  category  |
+----+--------+-------+------------+----------+------------+
| 1  | Jack   | 0     | 18         | css,php  | technology |
| 3  | Ali    | 1     | 5          | html     | political  |
| 2  | Peter  | 0     | 0          | NULL     | NULL       |
+----+--------+-------+------------+----------+------------+
-- Note: It's ordered by reputation, score columns

Here is what I've tried:

SELECT u.*,
      sum(r.score) as score,
      sum(r.reputation) as reputation
      /* WS_CONCAT(',', t.name) as tags */
FROM users u
LEFT JOIN reputation r ON r.user_id = u.id
/* I need more joins to get tags, I don't know how exactly */
WHERE r.date_time > unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)
GROUP BY u.id
ORDER BY reputation, score

As you can see, my query lacks the last column (tags). Honestly, I don't know how I should manage such a thing. Do you know how?

Best Answer

That's a tricky job for MySQL (PostgreSQL and other DBs) can handle it much easier.

For the sake of clarity, I've changed your tag.name to tag.tag and user.name to user.user_name, so that there's less risk of confusing the names. Note also that your data contains a user 4. I've called him James.

Let's do it one step at a time.

Step 1

First, let's get your expected output minus the tags:

SELECT 
     u.id, 
     u.user_name,
     coalesce(sum(r.score), 0) as score,
     coalesce(sum(r.reputation), 0) as reputation
FROM 
    users u
    LEFT JOIN reputations r 
        ON r.user_id = u.id 
           AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY 
    u.id, u.user_name
ORDER BY 
    reputation DESC, score DESC ;
id | user_name | score | reputation
-: | :-------- | ----: | ---------:
 1 | Jack      |     0 |         18
 4 | James     |     1 |          5
 2 | Peter     |     0 |          0
 3 | Ali       |     0 |          0

NOTE: the condition r.date_time > 1500584821 must be part of your JOIN ON condition. If it is in the where clause, your LEFT JOIN becomes just an INNER JOIN, and you will lose users with no reputation.

Step 2

Now we make another query that returns, for every user_id, the list of tags and its corresponding reputation:

SELECT
    u.id, u.user_name, t.tag, sum(r.reputation) AS tag_reputation
FROM
    users u
    LEFT JOIN reputations r 
        ON r.user_id = u.id 
            AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    JOIN post_tag pt ON pt.post_id = r.post_id
    JOIN tags t ON t.id = pt.tag_id
GROUP BY
    u.id, u.user_name, t.tag 
ORDER BY
    u.id, tag_reputation DESC;
id | user_name | tag   | tag_reputation
-: | :-------- | :---- | -------------:
 1 | Jack      | css   |             15
 1 | Jack      | php   |              5
 1 | Jack      | mysql |              5
 1 | Jack      | html  |             -2
 4 | James     | html  |              5

Step 3

We LEFT JOIN the two together (after a smallish simplification of the second, to ignore user names there), ON user_id and GROUP BY user_id, user_name, score, reputation

SELECT
    q1.user_id, q1.user_name, q1.score, q1.reputation, 
    substring_index(group_concat(q2.tag  ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags
FROM
    (SELECT 
        u.id AS user_Id, 
        u.user_name,
        coalesce(sum(r.score), 0) as score,
        coalesce(sum(r.reputation), 0) as reputation
    FROM 
        users u
        LEFT JOIN reputations r 
            ON    r.user_id = u.id 
              AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY 
        u.id, u.user_name
    ) AS q1
    LEFT JOIN
    (
    SELECT
        r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation
    FROM
        reputations r 
        JOIN post_tag pt ON pt.post_id = r.post_id
        JOIN tags t ON t.id = pt.tag_id
    WHERE
        r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
    GROUP BY
        user_id, t.tag
    ) AS q2
    ON q2.user_id = q1.user_id 
GROUP BY
    q1.user_id, q1.user_name, q1.score, q1.reputation
ORDER BY
    q1.reputation DESC, q1.score DESC ;
user_id | user_name | score | reputation | top_two_tags
------: | :-------- | ----: | ---------: | :-----------
      1 | Jack      |     0 |         18 | css,mysql   
      4 | James     |     1 |          5 | html        
      2 | Peter     |     0 |          0 | null        
      3 | Ali       |     0 |          0 | null        

Note that group_concat(q2.tag ORDER BY q2.tag_reputation DESC SEPARATOR ',') does return all tags associated with the user. We cut to two by means of substring_index(string,separator,number).

You can find all the settings, and queries at dbfiddle here


Side note (ties when ranking): for user Jack, the php and mysql tags have the same tag_reputation. I.e.: there's a tie between second and third. In that case, and without a second criterion for sorting the tags, both css,mysql and css,php are acceptable answers for the "top two" tags.