DISTINCT ON()
Just as a side note, this is precisely what DISTINCT ON()
does (not to be confused with DISTINCT
)
SELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON
expressions are interpreted using the same rules as for ORDER BY
(see above). Note that the "first row" of each set is unpredictable unless ORDER BY
is used to ensure that the desired row appears first. For example
So if you were to write,
SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;
It's effectively
SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;
In that it takes the first z
. There are two important differences,
You can also select other columns at no cost of further aggregation..
SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;
Because there is no GROUP BY
you can not use (real) aggregates with it.
CREATE TABLE foo AS
SELECT * FROM ( VALUES
(1,2,3),
(1,2,4),
(1,2,5)
) AS t(x,y,z);
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;
-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;
-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;
Don't forget ORDER BY
Also, while I didn't bold it then I will now
Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example
Always use an ORDER BY
with DISTINCT ON
Using an Ordered-Set Aggregate Function
I imagine a lot of people are looking for first_value
, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:
SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)
FROM foo
GROUP BY a,b;
But, alas you can do this.
SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)
FROM foo
GROUP BY a,b;
This:
SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id
Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:
SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');
or:
SELECT users.*
FROM users
WHERE EXISTS (
SELECT 1
FROM timesheets
WHERE timesheets.user_id = users.id
AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);
It works because users.id
is the primary key, so all other fields of users
are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name
or whatever for any given users.id
row.
(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR
about needing to use an aggregate or include the field in the GROUP BY
here).
Best Answer
Assuming a table like:
The following query appears to work in sqllite (as far as I can tell this is valid SQL99 since p1.weight, p1.nationality is functionally dependent of p1.player_id. Most vendors still implement the more restrictive SQL92 rule for group by). Also note that I prefer ansi join over "," joins, I find them easier to read.
having is just syntactic sugar for the chattier:
Another variant is to join against a derived table, pretty much the same query as your original:
On the other hand you can use window functions to achieve the same thing:
You can find an example at: db-fiddle