SQL Spec – Does EXISTS() Require a GROUP BY?

existsgroup bypostgresqlsql serversql-standard

Microsoft currently permits this syntax.

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT *
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  HAVING count(*) > 1
);

Notice that there is no GROUP BY in the EXISTS clause, is that valid ANSI SQL. Or is it merely exposing an implementation detail.

For reference, this same syntax isn't permitted in PostgreSQL.

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

But this syntax is permitted..

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT 1  -- This changed from the first query
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  HAVING count(*) > 1
);

And this syntax is permitted.

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT *
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  GROUP BY t.x  -- This changed from the first query
  HAVING count(*) > 1
);

Question arises from a conversation with @ErikE in chat

Best Answer

I found it in the SQL 2011 spec...

If the <select list> “*” is simply contained in a <table subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

This confirms that by * not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.

Keep in mind this is a distinct problem from

SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1

Which both databases reject.

PostgreSQL,

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

SQL Server,

Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Why this bug persists in PostgreSQL

Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation

20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows

An SRF is a set returning function.

In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series is in core)

SELECT * FROM generate_series(1,10); 

And, we likewise can put it right here.

SELECT generate_series(1,10);

Two of them together give us a cross-join (cartesian product)

SELECT generate_series(1,10), generate_series(1,2);

But, if either of those return 0-rows you get nothing.. Effectually the same as this

SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;

And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.