MySQL Median with GROUP BY – How to Calculate

aggregategroup byMySQL

Suppose the following table t1:

=================
|  tag  |  val  |       --+ for the sake of simplicity, val is non NULL
=================
|   a1  |  v1   |
|   a1  |  v2   |
|   a1  |  v3   |
|   a1  |  v4   |
|   a1  |  v5   |
|   a2  |  v6   |
|   a2  |  v7   |
|   a2  |  v8   |
|   a2  |  v9   |
|   ... | ...   |
=================

If you execute the script below in MySQL:

SELECT `tag`, AVG(`val`) FROM `t1` GROUP BY `tag`

You would get the average values grouped by the column tag:

=================
|  tag  | AVG() |
=================
|   a1  | avg1  |
|   a2  | avg2  |
|   a3  | avg3  |
|   a4  | avg4  |
|   ... |  ...  |
=================

Besides AVG(), MySQL has several other built in functions to calculate aggregate values (e.g. SUM(), MAX(), COUNT(), and STD()) that could be used in the same way as in the script aforementioned. However, there is no built in function for median.

This issue has already come up several other times at SE; however, most of them related to tables without GROUP BY. The only one with GROUP BY seems to be MySql: Count median grouped by day; however, the script seems to be overcomplicated.

Question

What would be an easy and simple way (if possible) to calculate this median?

Follow-up

Excellent article that complement the accepted answer:
http://danielsetzermann.com/howto/how-to-calculate-the-median-per-group-with-mysql/

Best Answer

This query could answer your question: median value and group by

            SELECT tag, AVG(val) as median
            FROM 
            (
              SELECT tag, val,
                  (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                  seq,
                  (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
                FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
                      FROM (SELECT * FROM median ORDER BY tag, val) t1 
                      ORDER BY tag, seq
                    ) t3 CROSS JOIN (SELECT @rownum := 0) x
                HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
                  or (ct%2 <> 0 and seq-delta = (ct+1)/2)
            ) T
            GROUP BY tag
            ORDER BY tag;

I tried it on this dataset (mainly from here):

            +------+------+
            | tag  | val  |
            +------+------+
            |    1 |    3 |
            |    1 |   13 |

... (see explanation below)

            |    3 |   12 |
            |    3 |   43 |
            |    3 |   15 |
            +------+------+

and the result was:

            +------+---------+
            | tag  | median  |
            +------+---------+
            |    1 | 23.0000 |
            |    2 | 22.0000 |
            |    3 | 15.0000 |
            +------+---------+

Explanation

Inner subqueries will be computed first: sequence is (1)(2)(3)(4).

-- (4) compute the average (of 2 lines or 1 line)

    SELECT tag, AVG(val) as median                          
      FROM 
      (

-- (3) get lines to compute the median value

        SELECT tag, val,                                       
           (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
              WHERE t2.tag = t3.tag) as ct,
           seq,
           (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
              WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
         FROM (

-- (2) sort dataset by tag and sequence

                SELECT tag, val,                            
                    @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
              FROM (

-- (1) sort dataset by tag and value

                    SELECT * FROM median           
                    ORDER BY tag, val) t1 

-- (2) continue here

              ORDER BY tag, seq
            ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)

-- (3) continue here

         HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                  and seq-delta between floor((ct+1)/2) 
                                and floor((ct+1)/2) +1)
           or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                  and seq-delta = (floor(ct+1)/2))
      ) T

-- (4) continue here

      GROUP BY tag
      ORDER BY tag;

Dataset:

        after (1)     after (2)           processing (3)   
    +------+------+                   
    | tag  | val  |  ct  delta  seq       seq-delta
    +------+------+                   
    |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
    |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
    |    1 |    7 |  15    0     3        3         
    |    1 |   12 |  15    0     4        4         
    |    1 |   13 |  15    0     5        5
    |    1 |   14 |  15    0     6        6
    |    1 |   21 |  15    0     7        7
    |    1 |   23 |  15    0     8        8 ---> keep this line
    |    1 |   23 |  15    0     9        9 
    |    1 |   23 |  15    0     10       10
    |    1 |   23 |  15    0     11       11
    |    1 |   29 |  15    0     12       12
    |    1 |   39 |  15    0     13       13
    |    1 |   40 |  15    0     14       14
    |    1 |   56 |  15    0     15       15

    |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
    |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
    |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
    |    2 |   12 |  14    15    19        4
    |    2 |   13 |  14    15    20        5
    |    2 |   14 |  14    15    21        6
    |    2 |   21 |  14    15    22        7 ---> keep this line
    |    2 |   23 |  14    15    23        8 ---> keep this line
    |    2 |   23 |  14    15    24        9
    |    2 |   23 |  14    15    25        10
    |    2 |   23 |  14    15    26        11
    |    2 |   29 |  14    15    27        12
    |    2 |   40 |  14    15    28        13
    |    2 |   56 |  14    15    29        14

    |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
    |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
    |    3 |   43 |  3     29    32        3
    +------+------+

Dataset after (3)

    +------+------+------+------+-------+
    | tag  | val  | ct   | seq  | delta |
    +------+------+------+------+-------+
    |    1 |   23 |   15 |    8 |     0 |
    |    2 |   21 |   14 |   22 |    15 |
    |    2 |   23 |   14 |   23 |    15 |
    |    3 |   15 |    3 |   31 |    29 |
    +------+------+------+------+-------+

Outer query will compute the avg(val) group by tag value.

Hope this helps.

But what about median computing when there are null values? See EDIT2 below

Alternative: using a function

    DELIMITER //
    CREATE FUNCTION median(pTag int)
        RETURNS real
           READS SQL DATA
           DETERMINISTIC
           BEGIN
              DECLARE r real; -- result
    SELECT AVG(val) INTO r
    FROM 
    (
      SELECT val,
           (SELECT count(*) FROM median WHERE tag = pTag) as ct,
           seq
        FROM (SELECT val, @rownum := @rownum + 1 as seq
              FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 
              ORDER BY seq
            ) t3 
            CROSS JOIN (SELECT @rownum := 0) x
        HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
          or (ct%2 <> 0 and seq = (ct+1)/2)
    ) T;
    return r;
    END//
    DELIMITER ;

But the function will be called for each row:

SELECT tag, median(tag) FROM median; -- my test table is 'median' too...

This query will be "better":

select tag, median(tag) 
  from (select distinct tag from median) t;

That's all I can do! Hope it helps!

EDIT2 : about null values in data (column val in the example)

null values show be omited from the source data using a WHERE clause : WHERE val IS NOT NULL, in both 2 subqueriƩs that count lines and the subquery that gets data.

EDIT3 (LAST EDIT) : change the initialisation of the @rownum position

It should by put at the deepest level : so that it declared the soonest in the excution of the query.

DELIMITER //
CREATE FUNCTION median(pTag int)
    RETURNS real
       READS SQL DATA
       DETERMINISTIC
       BEGIN
          DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM 
(
  SELECT val,
       (SELECT count(*) FROM median WHERE tag = pTag and val is not null)     as ct,
       seq
    FROM (SELECT val, @rownum := @rownum + 1 as seq
          FROM (SELECT * FROM median 
                       CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
             WHERE tag = pTag and val is not null ORDER BY val 
          ) t1 
          ORDER BY seq
       ) t3     
    HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
      or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;

That is the same for the query.

Test with 2 data sets more :

|    4 | NULL |
|    4 |   10 |
|    4 |   15 |
|    4 |   20 |
|    5 | NULL |
|    5 | NULL |
|    5 | NULL |
+------+------+

39 rows in set (0.00 sec)

+------+--------------+
| tag  | median2(tag) |
+------+--------------+
|    1 |           23 |
|    2 |           22 |
|    3 |           15 |
|    4 |           15 |
|    5 |         NULL |
+------+--------------+
5 rows in set (0.08 sec)