Usage of aggregate function with when self-joining a table

aggregatefunctionsjoin;self-joinsqlite

I want to find people from a database who has a weight bigger then the average weight:

SELECT *
FROM Player p1, Player p2
WHERE p1.weight > avg(p2.weight)

But I have an error:

Result: misuse of aggregate function avg()

I know that I can write:

SELECT *
FROM Player
WHERE weight > (SELECT AVG(weight)
                FROM Player)

But how can I achieve the same by using self join and without subquery?

Best Answer

Assuming a table like:

CREATE TABLE player
    ( player_id int not null primary key
    , weight int not null
    , nationality char(2) not null -- just any additional attribute
    );

insert into player (player_id, weight, nationality)
values (1,70,'SE'), (2,75,'RU'), (3,60,'US');

The following query appears to work in sqllite (as far as I can tell this is valid SQL99 since p1.weight, p1.nationality is functionally dependent of p1.player_id. Most vendors still implement the more restrictive SQL92 rule for group by). Also note that I prefer ansi join over "," joins, I find them easier to read.

select p1.player_id, p1.weight, p1.nationality
from player p1
cross join player p2
group by p1.player_id
having p1.weight > avg(p2.weight);

having is just syntactic sugar for the chattier:

select player_id, weight, nationality
from (
    select p1.*, avg(p2.weight) as avg_weight
    from player p1
    cross join player p2
    group by p1.player_id
) as t 
where weight > avg_weight;

Another variant is to join against a derived table, pretty much the same query as your original:

select p1.*
from player p1
cross join (
  select avg(weight) as avg_weight
  from player
) as p2 
where p1.weight > p2.avg_weight;

On the other hand you can use window functions to achieve the same thing:

select player_id, weight, nationality
from (
    select p.*
         , avg(weight) OVER () as avg_weight 
    from player p
)
where weight > avg_weight;

You can find an example at: db-fiddle

`DISTINCT ON()`

Just as a side note, this is precisely what DISTINCT ON() does (not to be confused with DISTINCT)

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

So if you were to write,

SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;

It's effectively

SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;

In that it takes the first z. There are two important differences,

You can also select other columns at no cost of further aggregation..

SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;

Because there is no GROUP BY you can not use (real) aggregates with it.

CREATE TABLE foo AS
SELECT * FROM ( VALUES
  (1,2,3),
  (1,2,4),
  (1,2,5)
) AS t(x,y,z);

SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;

Don't forget `ORDER BY`

Also, while I didn't bold it then I will now

Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

Always use an ORDER BY with DISTINCT ON

Using an Ordered-Set Aggregate Function

I imagine a lot of people are looking for first_value, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:

SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)    
FROM foo
GROUP BY a,b;

But, alas you can do this.

SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)   
FROM foo
GROUP BY a,b;

Mysql – group by clause without aggregate function

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

Best Answer

Related Solutions

PostgreSQL – Is There a Type-Safe first() Aggregate Function?

DISTINCT ON()

Don't forget ORDER BY

Using an Ordered-Set Aggregate Function

Mysql – group by clause without aggregate function

Related Question

`DISTINCT ON()`

Don't forget `ORDER BY`