PostgreSQL – Resolving ‘Column Must Appear in the GROUP BY Clause’ Error

group bypostgresql

I have a table that looks like this:

order_number | key_a | key_b | key_c | key_d | quantity
-------------|-------|-------|-------|------------------
14400        | 2     | 3     | 4     |    54 | 40
14000        | 2     | 3     | 4     |    87 | 13

I'm trying to run the following query:

select order_number,key_a,key_b,key_c,sum(quantity) from my_table
      group by (order_number,key_a,key_b,key_c);

Yet I'm getting the error:

 ERROR:   column "my_table.order_number" must appear in the GROUP BY clause or be used in an aggregate function

Since the column very clearly IS present in the GROUP BY clause,
I am very confused.

I also tried explicitly naming the column as my_table.order_number as shown in the error message, but no dice.

Best Answer

I figured it out. It's the damn parentheses around the columns in the GROUP BY clause. The correct query is:

select order_number,key_a,key_b,key_c,sum(quantity) from my_table
      group by order_number,key_a,key_b,key_c;

Related Solutions

PostgreSQL – Aggregated Column Causes Full Table Scan Despite Index

The problem is that your 2nd query:

SELECT datasets.id 
FROM datasets 
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25 ;

does not mean what you expect. It does give you the first 25 rows ordered by date_added only because the id is the primary key of the table, so the GROUP BY can be removed without changing the result.

It seems however that the optimizer does not always remove the redundant GROUP BY and thus it produces a different plan. I'm not sure why - the various features of the optimizer that do these simplifications are far from covering all cases.

You might get a better plan if you change the query to have matching GROUP BY and ORDER BY clauses:

SELECT d.id 
FROM datasets AS d 
GROUP BY d.date_added, d.id 
ORDER BY d.date_added, d.id 
LIMIT 25 ;

But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".

Now for the 3rd query, with the join, while the GROUP BY method is working, you can rewrite it by using standard SQL window functions (ROW_NUMBER()) or Postgres DISTINCT ON or by joining to a derived table (which uses your very first query!, with minor details changed):

SELECT  
    d.id,
    array_remove(array_agg(o.some_column), NULL) AS other_table
FROM 
  ( SELECT d.id, d.date_added
    FROM datasets AS d 
    ORDER BY d.date_added 
    LIMIT 25 
  ) AS d
LEFT JOIN other_table AS o
    ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;

We could also avoid GROUP BY completely (well, it's hidden in the inline subquery):

SELECT  
    d.id,
    ( SELECT array_remove(array_agg(o.some_column), NULL)
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS other_table
FROM  datasets AS d 
ORDER BY d.date_added 
LIMIT 25 ;

Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.

If you need aggregate from more columns, a third method combines both of the above, using a correlated (LATERAL) subquery in the FROM clause:

SELECT  
    d.id,
    o.other_table
    -- more aggregates
FROM 
    ( SELECT d.id, d.date_added
      FROM datasets AS d 
      ORDER BY d.date_added 
      LIMIT 25 
    ) AS d
  LEFT JOIN LATERAL
    ( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
             -- more aggregates
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS o
    ON TRUE
ORDER BY d.date_added
LIMIT 25 ;

SQL Spec – Does EXISTS() Require a GROUP BY?

I found it in the SQL 2011 spec...

If the <select list> “*” is simply contained in a <table subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

This confirms that by * not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.

Keep in mind this is a distinct problem from

SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1

Which both databases reject.

PostgreSQL,

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

SQL Server,

Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Why this bug persists in PostgreSQL

Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation

20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows

An SRF is a set returning function.

In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series is in core)

SELECT * FROM generate_series(1,10);

And, we likewise can put it right here.

SELECT generate_series(1,10);

Two of them together give us a cross-join (cartesian product)

SELECT generate_series(1,10), generate_series(1,2);

But, if either of those return 0-rows you get nothing.. Effectually the same as this

SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;

And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.

Best Answer

Related Solutions

PostgreSQL – Aggregated Column Causes Full Table Scan Despite Index

SQL Spec – Does EXISTS() Require a GROUP BY?

Why this bug persists in PostgreSQL

Related Question