Mysql – group by clause without aggregate function

aggregateMySQLpostgresql

I've always assumed GROUP BY was designed specifically for aggregate functions and in all other circumstances you should use ORDER BY. For example, we have three tables: orders, shippers, and employees:

OrderID     CustomerID  EmployeeID  OrderDate   ShipperID
10248   90  5   1996-07-04  3
10249   81  6   1996-07-05  1
10250   34  4   1996-07-08  2

ShipperID   ShipperName     Phone
1   Speedy Express  (503) 555-9831
2   United Package  (503) 555-3199
3   Federal Shipping    (503) 555-9931


EmployeeID  LastName    FirstName   BirthDate   Photo   Notes
1   Davolio     Nancy   1968-12-08  EmpID1.pic  Education includes a BA....
2   Fuller  Andrew  1952-02-19  EmpID2.pic  Andrew received his BTS....
3   Leverling   Janet   1963-08-30  EmpID3.pic  Janet has a BS degree...

We can then determine the number of orders sent by a shipper:

SELECT Shippers.ShipperName,COUNT(Orders.OrderID) AS NumberOfOrders FROM Orders
LEFT JOIN Shippers
ON Orders.ShipperID=Shippers.ShipperID
GROUP BY ShipperName; 

Group By helps here because it prevents a duplicate shipper name in the result set. That is, the aggregate function itself without using GROUP BY would return two rows of shipper name if shipper name appears more than once in the shippers table. GROUP BY gives our aggregate without duplicates.

Makes sense. But then I come across this result set from an ORM (ActiveRecord in Rails in this case):

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

There were no aggregate functions in that sql statement. Shouldn't it be using ORDER BY instead?

Best Answer

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).