Postgresql – GROUP BY after ORDER BY/LIMIT/OFFSEt

group bypostgresql

I have two tables events and eventdate, where eventdate has a row event that corresponds to the id column in events. (So an event can have multiple dates). `

Now I am trying to get the unique events of the first five eventdates after a certain date. I tried this query:

SELECT ed."event"
    FROM public."eventdate" as ed
    GROUP BY ed."event"
    ORDER BY ed."start_date" ASC
    LIMIT 5
    OFFSET 0

but the following error is thrown:

ERROR: column "ed.start_date" must appear in the GROUP BY clause or
be used in an aggregate function
LINE 4: ORDER BY ed."start_date" DESC

To be clear, I expect a list of ids, the list should have at maximum 5 entries, and no duplicates (a duplicate would just be "removed").

Given the following set:

id | event | start_date
1  | 1     | 0
2  | 2     | 0
3  | 1     | 2
4  | 4     | 3
5  | 3     | 4
6  | 1     | 5
7  | 5     | 6
7  | 6     | 6

The result would be:

event
1 (if I'd count the ids here it would be "2")
2
4
3

so basically the invariant would be: summing the count of each group = 5

Best Answer

If I get it right, you can use window functions:

select event
from (
    select ed.event, ed.start_date
         , row_number() over (partition by ed.event order by ed.start_date) as rn
    from public.eventdate
) as x
where rn = 1
order by start_date
limit 5

In the inner select, the events are given an ordering number per event. In the outer select, the oldest one per event is chosen and the result is ordered by this date.

If this is not what you meant, you can change your query so that the order by becomes deterministic, by adding an aggregate function (I used MIN here):

SELECT ed."event"
FROM public."eventdate" as ed
GROUP BY ed."event"
ORDER BY MIN(ed."start_date") ASC
LIMIT 5
OFFSET 0

Explanation

Inner subqueries will be computed first: sequence is (1)(2)(3)(4).

-- (4) compute the average (of 2 lines or 1 line)

    SELECT tag, AVG(val) as median                          
      FROM 
      (

-- (3) get lines to compute the median value

        SELECT tag, val,                                       
           (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
              WHERE t2.tag = t3.tag) as ct,
           seq,
           (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
              WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
         FROM (

-- (2) sort dataset by tag and sequence

                SELECT tag, val,                            
                    @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
              FROM (

-- (1) sort dataset by tag and value

                    SELECT * FROM median           
                    ORDER BY tag, val) t1

-- (2) continue here

              ORDER BY tag, seq
            ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)

-- (3) continue here

         HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                  and seq-delta between floor((ct+1)/2) 
                                and floor((ct+1)/2) +1)
           or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                  and seq-delta = (floor(ct+1)/2))
      ) T

-- (4) continue here

      GROUP BY tag
      ORDER BY tag;

Dataset:

        after (1)     after (2)           processing (3)   
    +------+------+                   
    | tag  | val  |  ct  delta  seq       seq-delta
    +------+------+                   
    |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
    |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
    |    1 |    7 |  15    0     3        3         
    |    1 |   12 |  15    0     4        4         
    |    1 |   13 |  15    0     5        5
    |    1 |   14 |  15    0     6        6
    |    1 |   21 |  15    0     7        7
    |    1 |   23 |  15    0     8        8 ---> keep this line
    |    1 |   23 |  15    0     9        9 
    |    1 |   23 |  15    0     10       10
    |    1 |   23 |  15    0     11       11
    |    1 |   29 |  15    0     12       12
    |    1 |   39 |  15    0     13       13
    |    1 |   40 |  15    0     14       14
    |    1 |   56 |  15    0     15       15

    |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
    |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
    |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
    |    2 |   12 |  14    15    19        4
    |    2 |   13 |  14    15    20        5
    |    2 |   14 |  14    15    21        6
    |    2 |   21 |  14    15    22        7 ---> keep this line
    |    2 |   23 |  14    15    23        8 ---> keep this line
    |    2 |   23 |  14    15    24        9
    |    2 |   23 |  14    15    25        10
    |    2 |   23 |  14    15    26        11
    |    2 |   29 |  14    15    27        12
    |    2 |   40 |  14    15    28        13
    |    2 |   56 |  14    15    29        14

    |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
    |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
    |    3 |   43 |  3     29    32        3
    +------+------+

Dataset after (3)

    +------+------+------+------+-------+
    | tag  | val  | ct   | seq  | delta |
    +------+------+------+------+-------+
    |    1 |   23 |   15 |    8 |     0 |
    |    2 |   21 |   14 |   22 |    15 |
    |    2 |   23 |   14 |   23 |    15 |
    |    3 |   15 |    3 |   31 |    29 |
    +------+------+------+------+-------+

Outer query will compute the avg(val) group by tag value.

Hope this helps.

But what about median computing when there are null values? See EDIT2 below

Alternative: using a function

    DELIMITER //
    CREATE FUNCTION median(pTag int)
        RETURNS real
           READS SQL DATA
           DETERMINISTIC
           BEGIN
              DECLARE r real; -- result
    SELECT AVG(val) INTO r
    FROM 
    (
      SELECT val,
           (SELECT count(*) FROM median WHERE tag = pTag) as ct,
           seq
        FROM (SELECT val, @rownum := @rownum + 1 as seq
              FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 
              ORDER BY seq
            ) t3 
            CROSS JOIN (SELECT @rownum := 0) x
        HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
          or (ct%2 <> 0 and seq = (ct+1)/2)
    ) T;
    return r;
    END//
    DELIMITER ;

But the function will be called for each row:

SELECT tag, median(tag) FROM median; -- my test table is 'median' too...

This query will be "better":

select tag, median(tag) 
  from (select distinct tag from median) t;

That's all I can do! Hope it helps!

EDIT2 : about null values in data (column val in the example)

null values show be omited from the source data using a WHERE clause : WHERE val IS NOT NULL, in both 2 subqueriés that count lines and the subquery that gets data.

EDIT3 (LAST EDIT) : change the initialisation of the @rownum position

It should by put at the deepest level : so that it declared the soonest in the excution of the query.

DELIMITER //
CREATE FUNCTION median(pTag int)
    RETURNS real
       READS SQL DATA
       DETERMINISTIC
       BEGIN
          DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM 
(
  SELECT val,
       (SELECT count(*) FROM median WHERE tag = pTag and val is not null)     as ct,
       seq
    FROM (SELECT val, @rownum := @rownum + 1 as seq
          FROM (SELECT * FROM median 
                       CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
             WHERE tag = pTag and val is not null ORDER BY val 
          ) t1 
          ORDER BY seq
       ) t3     
    HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
      or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;

That is the same for the query.

Test with 2 data sets more :

|    4 | NULL |
|    4 |   10 |
|    4 |   15 |
|    4 |   20 |
|    5 | NULL |
|    5 | NULL |
|    5 | NULL |
+------+------+

39 rows in set (0.00 sec)

+------+--------------+
| tag  | median2(tag) |
+------+--------------+
|    1 |           23 |
|    2 |           22 |
|    3 |           15 |
|    4 |           15 |
|    5 |         NULL |
+------+--------------+
5 rows in set (0.08 sec)

Sql-server – Group by multiple columns, agregate others and select all in SQL Server

This is a greatest-n-per-group problem and there are many ways to solve it (CROSS APPLY, window functions, subquery with GROUP BY, etc). Here's a method using window functions and a CTE:

WITH ct AS
  ( SELECT *,
           rn = RANK() OVER (PARTITION BY PARAMETER_NAME, GW_LOCATION_ID
                             ORDER BY Report_Result DESC)
    FROM SLVs_Flagged
  )
SELECT PARAMETER_NAME, GW_LOCATION_ID, 
       Max_Report_Result = Report_Result,
       DETECT_FLAG
       -- more columns
FROM ct
WHERE rn = 1
ORDER BY PARAMETER_NAME, GW_LOCATION_ID ;

The query will return all tied results (if there are ties). If you want a single result for every (PARAMETER_NAME, GW_LOCATION_ID) combination, you can resolve ties by using ROW_NUMBER() instead of RANK() and modifying the ORDER BY inside the OVER (..) clause. Eg. (prefer DETECT_FLAG with N over Y):

           rn = ROW_NUMBER() OVER (PARTITION BY PARAMETER_NAME, GW_LOCATION_ID
                                   ORDER BY Report_Result DESC, DETECT_FLAG)

Best Answer

Related Solutions

Mysql – Median with GROUP BY

Explanation

Alternative: using a function

EDIT2 : about null values in data (column val in the example)

EDIT3 (LAST EDIT) : change the initialisation of the @rownum position

Sql-server – Group by multiple columns, agregate others and select all in SQL Server

Related Question