This query could answer your question: median value and group by
SELECT tag, AVG(val) as median
FROM
(
SELECT tag, val,
(SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median ORDER BY tag, val) t1
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq-delta = (ct+1)/2)
) T
GROUP BY tag
ORDER BY tag;
I tried it on this dataset (mainly from here):
+------+------+
| tag | val |
+------+------+
| 1 | 3 |
| 1 | 13 |
... (see explanation below)
| 3 | 12 |
| 3 | 43 |
| 3 | 15 |
+------+------+
and the result was:
+------+---------+
| tag | median |
+------+---------+
| 1 | 23.0000 |
| 2 | 22.0000 |
| 3 | 15.0000 |
+------+---------+
Explanation
Inner subqueries will be computed first: sequence is (1)(2)(3)(4).
-- (4) compute the average (of 2 lines or 1 line)
SELECT tag, AVG(val) as median
FROM
(
-- (3) get lines to compute the median value
SELECT tag, val,
(SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct
WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta
WHERE t2.tag < t3.tag) as delta -- to compute the starting line number of a tag
FROM (
-- (2) sort dataset by tag and sequence
SELECT tag, val,
@rownum := @rownum + 1 as seq -- +@rownum enable to create a sequence from 0 by 1
FROM (
-- (1) sort dataset by tag and value
SELECT * FROM median
ORDER BY tag, val) t1
-- (2) continue here
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x -- +use to set @rownum to 0 (no data)
-- (3) continue here
HAVING (ct%2 = 0 -- +when ct is even, select the two lines around the middle
and seq-delta between floor((ct+1)/2)
and floor((ct+1)/2) +1)
or (ct%2 <> 0 -- +when ct is odd, select the one line in the middle
and seq-delta = (floor(ct+1)/2))
) T
-- (4) continue here
GROUP BY tag
ORDER BY tag;
Dataset:
after (1) after (2) processing (3)
+------+------+
| tag | val | ct delta seq seq-delta
+------+------+
| 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0
| 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8
| 1 | 7 | 15 0 3 3
| 1 | 12 | 15 0 4 4
| 1 | 13 | 15 0 5 5
| 1 | 14 | 15 0 6 6
| 1 | 21 | 15 0 7 7
| 1 | 23 | 15 0 8 8 ---> keep this line
| 1 | 23 | 15 0 9 9
| 1 | 23 | 15 0 10 10
| 1 | 23 | 15 0 11 11
| 1 | 29 | 15 0 12 12
| 1 | 39 | 15 0 13 13
| 1 | 40 | 15 0 14 14
| 1 | 56 | 15 0 15 15
| 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 )
| 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7
| 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8
| 2 | 12 | 14 15 19 4
| 2 | 13 | 14 15 20 5
| 2 | 14 | 14 15 21 6
| 2 | 21 | 14 15 22 7 ---> keep this line
| 2 | 23 | 14 15 23 8 ---> keep this line
| 2 | 23 | 14 15 24 9
| 2 | 23 | 14 15 25 10
| 2 | 23 | 14 15 26 11
| 2 | 29 | 14 15 27 12
| 2 | 40 | 14 15 28 13
| 2 | 56 | 14 15 29 14
| 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0
| 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2
| 3 | 43 | 3 29 32 3
+------+------+
Dataset after (3)
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 | 29 |
+------+------+------+------+-------+
Outer query will compute the avg(val) group by tag value.
Hope this helps.
But what about median computing when there are null values? See EDIT2 below
Alternative: using a function
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1
ORDER BY seq
) t3
CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
But the function will be called for each row:
SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
This query will be "better":
select tag, median(tag)
from (select distinct tag from median) t;
That's all I can do! Hope it helps!
EDIT2 : about null values in data (column val in the example)
null values show be omited from the source data using a WHERE clause : WHERE val IS NOT NULL
, in both 2 subqueriƩs that count lines and the subquery that gets data.
EDIT3 (LAST EDIT) : change the initialisation of the @rownum position
It should by put at the deepest level : so that it declared the soonest in the excution of the query.
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag and val is not null) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median
CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
WHERE tag = pTag and val is not null ORDER BY val
) t1
ORDER BY seq
) t3
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
That is the same for the query.
Test with 2 data sets more :
| 4 | NULL |
| 4 | 10 |
| 4 | 15 |
| 4 | 20 |
| 5 | NULL |
| 5 | NULL |
| 5 | NULL |
+------+------+
39 rows in set (0.00 sec)
+------+--------------+
| tag | median2(tag) |
+------+--------------+
| 1 | 23 |
| 2 | 22 |
| 3 | 15 |
| 4 | 15 |
| 5 | NULL |
+------+--------------+
5 rows in set (0.08 sec)
This is a greatest-n-per-group
problem and there are many ways to solve it (CROSS APPLY
, window functions, subquery with GROUP BY
, etc). Here's a method using window functions and a CTE:
WITH ct AS
( SELECT *,
rn = RANK() OVER (PARTITION BY PARAMETER_NAME, GW_LOCATION_ID
ORDER BY Report_Result DESC)
FROM SLVs_Flagged
)
SELECT PARAMETER_NAME, GW_LOCATION_ID,
Max_Report_Result = Report_Result,
DETECT_FLAG
-- more columns
FROM ct
WHERE rn = 1
ORDER BY PARAMETER_NAME, GW_LOCATION_ID ;
The query will return all tied results (if there are ties). If you want a single result for every (PARAMETER_NAME, GW_LOCATION_ID)
combination, you can resolve ties by using ROW_NUMBER()
instead of RANK()
and modifying the ORDER BY
inside the OVER (..)
clause. Eg. (prefer DETECT_FLAG
with N
over Y
):
rn = ROW_NUMBER() OVER (PARTITION BY PARAMETER_NAME, GW_LOCATION_ID
ORDER BY Report_Result DESC, DETECT_FLAG)
Best Answer
If I get it right, you can use window functions:
In the inner select, the events are given an ordering number per event. In the outer select, the oldest one per event is chosen and the result is ordered by this date.
If this is not what you meant, you can change your query so that the order by becomes deterministic, by adding an aggregate function (I used MIN here):