I've always assumed GROUP BY
was designed specifically for aggregate functions and in all other circumstances you should use ORDER BY
. For example, we have three tables: orders, shippers, and employees:
OrderID CustomerID EmployeeID OrderDate ShipperID
10248 90 5 1996-07-04 3
10249 81 6 1996-07-05 1
10250 34 4 1996-07-08 2
ShipperID ShipperName Phone
1 Speedy Express (503) 555-9831
2 United Package (503) 555-3199
3 Federal Shipping (503) 555-9931
EmployeeID LastName FirstName BirthDate Photo Notes
1 Davolio Nancy 1968-12-08 EmpID1.pic Education includes a BA....
2 Fuller Andrew 1952-02-19 EmpID2.pic Andrew received his BTS....
3 Leverling Janet 1963-08-30 EmpID3.pic Janet has a BS degree...
We can then determine the number of orders sent by a shipper:
SELECT Shippers.ShipperName,COUNT(Orders.OrderID) AS NumberOfOrders FROM Orders
LEFT JOIN Shippers
ON Orders.ShipperID=Shippers.ShipperID
GROUP BY ShipperName;
Group By helps here because it prevents a duplicate shipper name in the result set. That is, the aggregate function itself without using GROUP BY
would return two rows of shipper name if shipper name appears more than once in the shippers table. GROUP BY
gives our aggregate without duplicates.
Makes sense. But then I come across this result set from an ORM (ActiveRecord in Rails in this case):
SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id
There were no aggregate functions in that sql statement. Shouldn't it be using ORDER BY instead?
Best Answer
This:
Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:
or:
It works because
users.id
is the primary key, so all other fields ofusers
are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidateusers.name
or whatever for any givenusers.id
row.(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an
ERROR
about needing to use an aggregate or include the field in theGROUP BY
here).