I have two tables in a MySQL 5.7.22 database: posts
and reasons
. Each post row has and belongs to many reason rows. Each reason has a weight associated with it, and each post therefore has a total aggregated weight associated with it.
For each increment of 10 points of weight (i.e. for 0, 10, 20, 30, etc), I want to get a count of posts that have a total weight less than or equal to that increment. I'd expect the results for that to look something like this:
weight | post_count
--------+------------
0 | 0
10 | 5
20 | 12
30 | 18
... | ...
280 | 20918
290 | 21102
... | ...
1250 | 118005
1260 | 118039
1270 | 118040
The total weights are approximately normally distributed, with a few very low values and a few very high values (maximum is currently 1277), but the majority in the middle. There are just under 120,000 rows in posts
, and around 120 in reasons
. Each post has on average 5 or 6 reasons.
The relevant parts of the tables look like this:
CREATE TABLE `posts` (
id BIGINT PRIMARY KEY
);
CREATE TABLE `reasons` (
id BIGINT PRIMARY KEY,
weight INT(11) NOT NULL
);
CREATE TABLE `posts_reasons` (
post_id BIGINT NOT NULL,
reason_id BIGINT NOT NULL,
CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);
So far, I've tried dropping the post ID and total weight into a view, then joining that view to itself to get an aggregated count:
CREATE VIEW `post_weights` AS (
SELECT
posts.id,
SUM(reasons.weight) AS reason_weight
FROM posts
INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
GROUP BY posts.id
);
SELECT
FLOOR(p1.reason_weight / 10) AS weight,
COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;
That is, however, unusably slow – I let it run for 15 minutes without terminating, which I can't do in production.
Is there a more efficient way to do this?
In case you're interested in testing the entire dataset, it's downloadable here. The file is around 60MB, it expands to around 250MB. Alternately, there are 12,000 rows in a GitHub gist here.
Best Answer
Using functions or expressions in JOIN conditions is usually a bad idea, I say usually because some optimisers can handle it fairly well and utilize indexes anyhow. I would suggest creating a table for the weights. Something like:
Make sure you have indexes on
posts_reasons
:A query like:
My machine at home is probably 5-6 years old, it has an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz and 8Gb of ram.
uname -a Linux dustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I tested against:
https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5
If performance is critical and nothing else helps you could create a summary table for:
You can maintain this table via triggers
Since there is a certain amount of work that needs to be done for each weight in weights, it may be beneficial to limit this table.
Since I had a lot of unnecesary rows in my weights table (max 2590), the restriction above cut the execution time from 9 down to 4 seconds.