SQL Spec – Does EXISTS() Require a GROUP BY?

existsgroup bypostgresqlsql serversql-standard

Microsoft currently permits this syntax.

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT *
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  HAVING count(*) > 1
);

Notice that there is no GROUP BY in the EXISTS clause, is that valid ANSI SQL. Or is it merely exposing an implementation detail.

For reference, this same syntax isn't permitted in PostgreSQL.

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

But this syntax is permitted..

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT 1  -- This changed from the first query
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  HAVING count(*) > 1
);

And this syntax is permitted.

SELECT *
FROM ( VALUES (1) ) AS g(x)
WHERE EXISTS (
  SELECT *
  FROM ( VALUES (1),(1) )
    AS t(x)
  WHERE g.x = t.x
  GROUP BY t.x  -- This changed from the first query
  HAVING count(*) > 1
);

Question arises from a conversation with @ErikE in chat

Best Answer

I found it in the SQL 2011 spec...

If the <select list> “*” is simply contained in a <table subquery> that is immediately contained in an <exists predicate>, then the <select list> is equivalent to a <value expression> that is an arbitrary <literal>.

This confirms that by * not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.

Keep in mind this is a distinct problem from

SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1

Which both databases reject.

PostgreSQL,

ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function

SQL Server,

Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Why this bug persists in PostgreSQL

Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation

20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows

An SRF is a set returning function.

In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series is in core)

SELECT * FROM generate_series(1,10);

And, we likewise can put it right here.

SELECT generate_series(1,10);

Two of them together give us a cross-join (cartesian product)

SELECT generate_series(1,10), generate_series(1,2);

But, if either of those return 0-rows you get nothing.. Effectually the same as this

SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;

And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.

Related Solutions

SQL:2008 Standard – Is CTE Optimization Fence Behavior Specified?

I think it's an implementation detail.

A conforming implementation is not required to perform the exact sequence of actions defined in the General Rules, provided its effect on SQL-data and schemas, on host parameters and host variable, and on SQL parameters and SQL variables is identical to the effect of that sequence. The term effectively is used to emphasize actions whose effect might be achieved in other ways by an implementation.¹

I think an implementer could evaluate a common table expression 20 times, even in 20 different ways, and still have a conforming implementation. The only relevant issue is whether "its effect . . . is identical to the effect" of the sequence of actions defined in the General rules.

[1]. Section 6.3.3.3, "Rule evaluation order", in a draft of the SQL 2008 standard, having the local filename 5CD2-01-Framework-2006-01.pdf, p. 41 I have no idea where I got it. Google might know.

Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.