MySQL – Get Incremental Counts of Aggregated Value in Joined Table

aggregateMySQLmysql-5.7

I have two tables in a MySQL 5.7.22 database: posts and reasons. Each post row has and belongs to many reason rows. Each reason has a weight associated with it, and each post therefore has a total aggregated weight associated with it.

For each increment of 10 points of weight (i.e. for 0, 10, 20, 30, etc), I want to get a count of posts that have a total weight less than or equal to that increment. I'd expect the results for that to look something like this:

 weight | post_count
--------+------------
      0 | 0
     10 | 5
     20 | 12
     30 | 18
    ... | ...
    280 | 20918
    290 | 21102
    ... | ...
   1250 | 118005
   1260 | 118039
   1270 | 118040

The total weights are approximately normally distributed, with a few very low values and a few very high values (maximum is currently 1277), but the majority in the middle. There are just under 120,000 rows in posts, and around 120 in reasons. Each post has on average 5 or 6 reasons.

The relevant parts of the tables look like this:

CREATE TABLE `posts` (
  id BIGINT PRIMARY KEY
);

CREATE TABLE `reasons` (
  id BIGINT PRIMARY KEY,
  weight INT(11) NOT NULL
);

CREATE TABLE `posts_reasons` (
  post_id BIGINT NOT NULL,
  reason_id BIGINT NOT NULL,
  CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
  CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);

So far, I've tried dropping the post ID and total weight into a view, then joining that view to itself to get an aggregated count:

CREATE VIEW `post_weights` AS (
    SELECT 
        posts.id,
        SUM(reasons.weight) AS reason_weight
    FROM posts
    INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
    INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
    GROUP BY posts.id
);

SELECT
    FLOOR(p1.reason_weight / 10) AS weight,
    COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;

That is, however, unusably slow – I let it run for 15 minutes without terminating, which I can't do in production.

Is there a more efficient way to do this?

In case you're interested in testing the entire dataset, it's downloadable here. The file is around 60MB, it expands to around 250MB. Alternately, there are 12,000 rows in a GitHub gist here.

Best Answer

Using functions or expressions in JOIN conditions is usually a bad idea, I say usually because some optimisers can handle it fairly well and utilize indexes anyhow. I would suggest creating a table for the weights. Something like:

CREATE TABLE weights
( weight int not null primary key 
);

INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);

Make sure you have indexes on posts_reasons:

CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);

A query like:

SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

My machine at home is probably 5-6 years old, it has an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz and 8Gb of ram.

uname -a Linux dustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I tested against:

https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5

MariaDB [test3]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)


SELECT w.weight
     , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
       FROM reasons r
       JOIN posts_reasons pr
             ON r.id = pr.reason_id
       GROUP BY pr.post_id
     ) as x
    ON w.weight > x.sum_weight
GROUP BY w.weight;

+--------+------------+
| weight | post_count |
+--------+------------+
|      0 |          1 |
|     10 |       2591 |
|     20 |       4264 |
|     30 |       4386 |
|     40 |       5415 |
|     50 |       7499 |
[...]   
|   1270 |     119283 |
|   1320 |     119286 |
|   1330 |     119286 |
[...]
|   2590 |     119286 |
+--------+------------+
256 rows in set (9.89 sec)

If performance is critical and nothing else helps you could create a summary table for:

SELECT pr.post_id, SUM(r.weight) as sum_weight     
FROM reasons r
JOIN posts_reasons pr
    ON r.id = pr.reason_id
GROUP BY pr.post_id

You can maintain this table via triggers

Since there is a certain amount of work that needs to be done for each weight in weights, it may be beneficial to limit this table.

    ON w.weight > x.sum_weight 
WHERE w.weight <= (select MAX(sum_weights) 
                   from (SELECT SUM(weight) as sum_weights 
                   FROM reasons r        
                   JOIN posts_reasons pr
                       ON r.id = pr.reason_id 
                   GROUP BY pr.post_id) a
                  ) 
GROUP BY w.weight

Since I had a lot of unnecesary rows in my weights table (max 2590), the restriction above cut the execution time from 9 down to 4 seconds.