Suppose the following table t1
:
================= | tag | val | --+ for the sake of simplicity, val is non NULL ================= | a1 | v1 | | a1 | v2 | | a1 | v3 | | a1 | v4 | | a1 | v5 | | a2 | v6 | | a2 | v7 | | a2 | v8 | | a2 | v9 | | ... | ... | =================
If you execute the script below in MySQL:
SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`
You would get the average values grouped by the column tag
:
================= | tag | AVG() | ================= | a1 | avg1 | | a2 | avg2 | | a3 | avg3 | | a4 | avg4 | | ... | ... | =================
Besides AVG()
, MySQL has several other built in functions to calculate aggregate values (e.g. SUM()
, MAX()
, COUNT()
, and STD()
) that could be used in the same way as in the script aforementioned. However, there is no built in function for median.
This issue has already come up several other times at SE; however, most of them related to tables without GROUP BY
. The only one with GROUP BY
seems to be MySql: Count median grouped by day; however, the script seems to be overcomplicated.
Question
What would be an easy and simple way (if possible) to calculate this median?
Follow-up
Excellent article that complement the accepted answer:
http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/
Best Answer
This query could answer your question: median value and group by
I tried it on this dataset (mainly from here):
... (see explanation below)
and the result was:
Explanation
Inner subqueries will be computed first: sequence is (1)(2)(3)(4).
-- (4) compute the average (of 2 lines or 1 line)
-- (3) get lines to compute the median value
-- (2) sort dataset by tag and sequence
-- (1) sort dataset by tag and value
-- (2) continue here
-- (3) continue here
-- (4) continue here
Dataset:
Dataset after (3)
Outer query will compute the avg(val) group by tag value.
Hope this helps.
But what about median computing when there are null values? See EDIT2 below
Alternative: using a function
But the function will be called for each row:
This query will be "better":
That's all I can do! Hope it helps!
EDIT2 : about null values in data (column val in the example)
null values show be omited from the source data using a WHERE clause :
WHERE val IS NOT NULL
, in both 2 subqueriƩs that count lines and the subquery that gets data.EDIT3 (LAST EDIT) : change the initialisation of the @rownum position
It should by put at the deepest level : so that it declared the soonest in the excution of the query.
That is the same for the query.
Test with 2 data sets more :
39 rows in set (0.00 sec)