Mysql – group by clause without aggregate function

aggregateMySQLpostgresql

I've always assumed GROUP BY was designed specifically for aggregate functions and in all other circumstances you should use ORDER BY. For example, we have three tables: orders, shippers, and employees:

OrderID     CustomerID  EmployeeID  OrderDate   ShipperID
10248   90  5   1996-07-04  3
10249   81  6   1996-07-05  1
10250   34  4   1996-07-08  2

ShipperID   ShipperName     Phone
1   Speedy Express  (503) 555-9831
2   United Package  (503) 555-3199
3   Federal Shipping    (503) 555-9931


EmployeeID  LastName    FirstName   BirthDate   Photo   Notes
1   Davolio     Nancy   1968-12-08  EmpID1.pic  Education includes a BA....
2   Fuller  Andrew  1952-02-19  EmpID2.pic  Andrew received his BTS....
3   Leverling   Janet   1963-08-30  EmpID3.pic  Janet has a BS degree...

We can then determine the number of orders sent by a shipper:

SELECT Shippers.ShipperName,COUNT(Orders.OrderID) AS NumberOfOrders FROM Orders
LEFT JOIN Shippers
ON Orders.ShipperID=Shippers.ShipperID
GROUP BY ShipperName;

Group By helps here because it prevents a duplicate shipper name in the result set. That is, the aggregate function itself without using GROUP BY would return two rows of shipper name if shipper name appears more than once in the shippers table. GROUP BY gives our aggregate without duplicates.

Makes sense. But then I come across this result set from an ORM (ActiveRecord in Rails in this case):

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

There were no aggregate functions in that sql statement. Shouldn't it be using ORDER BY instead?

Best Answer

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

Related Solutions

Sql-server – Why is an aggregate query significantly faster with a GROUP BY clause than without one

It looks like it is probably following an index on CreatedDate in order from lowest to highest and doing lookups to evaluate the SomeIndexedValue = 1 predicate.

When it finds the first matching row it is done, but it may well be doing many more lookups than it expects before it finds such a row (it assumes the rows matching the predicate are randomly distributed according to date.)

See my answer here for a similar issue

The ideal index for this query would be one on SomeIndexedValue, CreatedDate. Assuming that you can't add that or at least make your existing index on SomeIndexedValue cover CreatedDate as an included column then you could try rewriting the query as follows

SELECT MIN(DATEADD(DAY, 0, CreatedDate)) AS CreatedDate
FROM MyTable
WHERE SomeIndexedValue = 1

to prevent it from using that particular plan.

Postgresql – Getting around the constraint “column must appear in the GROUP BY clause or be used in an aggregate function”

seharusnya seperti ini(supposed to be like this):

SELECT person.name, person.id, license.expiry_date, COUNT(car) FROM person
  JOIN license ON license.person_id = person.id
  JOIN car ON car.owner_id = person.id
WHERE person.name = 'Charles Bannerman'
GROUP BY person.name, person.id, license.expiry_date, car.car;

Best Answer

Related Solutions

Sql-server – Why is an aggregate query significantly faster with a GROUP BY clause than without one

Postgresql – Getting around the constraint “column must appear in the GROUP BY clause or be used in an aggregate function”

Related Question