Mysql – Why do we use Group by 1 and Group by 1,2,3 in SQL query

group byMySQLplsqlsyntax

In SQL queries, we do use Group by clause to apply aggregate functions.

But what is the purpose behind using numeric value instead of column
name with Group by clause? For example: Group by 1.

Best Answer

This is actually a really bad thing to do IMHO, and it's not supported in most other database platforms.

The reasons people do it:

they're lazy - I don't know why people think their productivity is improved by writing terse code rather than typing for an extra 40 milliseconds to get much more literal code.

The reasons it's bad:

it's not self-documenting - someone is going to have to go parse the SELECT list to figure out the grouping. It would actually be a little more clear in SQL Server, which doesn't support cowboy who-knows-what-will-happen grouping like MySQL does.
it's brittle - someone comes in and changes the SELECT list because the business users wanted a different report output, and now your output is a mess. If you had used column names in the GROUP BY, order in the SELECT list would be irrelevant.

SQL Server supports ORDER BY [ordinal]; here are some parallel arguments against its use:

https://sqlblog.org/2009/10/06/bad-habits-to-kick-order-by-ordinal

Related Solutions

Mysql – group by clause without aggregate function

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

SQL Spec – Does EXISTS() Require a GROUP BY?

I found it in the SQL 2011 spec...

If the <select list> “*” is simply contained in a <table subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

This confirms that by * not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.

Keep in mind this is a distinct problem from

SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1

Which both databases reject.

PostgreSQL,

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

SQL Server,

Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Why this bug persists in PostgreSQL

Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation

20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows

An SRF is a set returning function.

In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series is in core)

SELECT * FROM generate_series(1,10);

And, we likewise can put it right here.

SELECT generate_series(1,10);

Two of them together give us a cross-join (cartesian product)

SELECT generate_series(1,10), generate_series(1,2);

But, if either of those return 0-rows you get nothing.. Effectually the same as this

SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;

And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.

Best Answer

Related Solutions

Mysql – group by clause without aggregate function

SQL Spec – Does EXISTS() Require a GROUP BY?

Why this bug persists in PostgreSQL

Related Question