This:
SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id
Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:
SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');
or:
SELECT users.*
FROM users
WHERE EXISTS (
SELECT 1
FROM timesheets
WHERE timesheets.user_id = users.id
AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);
It works because users.id
is the primary key, so all other fields of users
are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name
or whatever for any given users.id
row.
(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR
about needing to use an aggregate or include the field in the GROUP BY
here).
I found it in the SQL 2011 spec...
If the <select list>
“*” is simply contained in a <table subquery>
that is immediately contained in an <exists predicate>
, then the <select list>
is equivalent to a <value expression>
that is an arbitrary <literal>
.
This confirms that by *
not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.
Keep in mind this is a distinct problem from
SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1
Which both databases reject.
PostgreSQL,
ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function
SQL Server,
Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Why this bug persists in PostgreSQL
Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation
20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows
An SRF is a set returning function.
In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series
is in core)
SELECT * FROM generate_series(1,10);
And, we likewise can put it right here.
SELECT generate_series(1,10);
Two of them together give us a cross-join (cartesian product)
SELECT generate_series(1,10), generate_series(1,2);
But, if either of those return 0-rows you get nothing.. Effectually the same as this
SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;
And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.
Best Answer
This is actually a really bad thing to do IMHO, and it's not supported in most other database platforms.
The reasons people do it:
The reasons it's bad:
it's not self-documenting - someone is going to have to go parse the SELECT list to figure out the grouping. It would actually be a little more clear in SQL Server, which doesn't support cowboy who-knows-what-will-happen grouping like MySQL does.
it's brittle - someone comes in and changes the SELECT list because the business users wanted a different report output, and now your output is a mess. If you had used column names in the GROUP BY, order in the SELECT list would be irrelevant.
SQL Server supports ORDER BY [ordinal]; here are some parallel arguments against its use: