I was thinking the NATURAL JOIN
example you just used
SELECT * FROM my_table NATURAL JOIN (
SELECT group_col, MAX(sort_col) sort_col
FROM my_table
GROUP BY group_col
) t
If you shift to another type of JOIN
and impose WHERE
, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY
.
For this example, I will
- use Windows 7
- use MySQL 5.5.12-log for Windows
- create some sample data
- impose a
LEFT JOIN
without a WHERE
clause
- impose a
LEFT JOIN
with a WHERE clause
For the DB Environment
mysql> select version();
+------------+
| version() |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)
mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name | Value |
+-------------------------+------------------------------+
| version_comment | MySQL Community Server (GPL) |
| version_compile_machine | x86 |
| version_compile_os | Win64 |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)
mysql>
Using this script to generate sample data
DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
id int not null auto_increment,
num int,
primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;
and these two queries for testing the GROUP BY
subsequent use;
SELECT * FROM groupby A LEFT JOIN
(
SELECT num, MAX(id) id
FROM groupby
GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
SELECT num, MAX(id) id
FROM groupby
GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;
Let's test the durability of the GROUP BY
's results;
STEP 01 : Create the Sample Data
mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)
mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)
mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
-> (
-> id int not null auto_increment,
-> num int,
-> primary key (id)
-> );
Query OK, 0 rows affected (0.07 sec)
mysql> INSERT INTO groupby (num) VALUES
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM groupby;
+----+------------+
| id | num |
+----+------------+
| 1 | 269529129 |
| 2 | 387090406 |
| 3 | 1126864683 |
| 4 | 411160755 |
| 5 | 29173595 |
| 6 | 266349579 |
| 7 | 1244227156 |
| 8 | 6231766 |
| 9 | 269529129 |
| 10 | 387090406 |
| 11 | 1126864683 |
| 12 | 411160755 |
| 13 | 29173595 |
| 14 | 266349579 |
| 15 | 1244227156 |
| 16 | 6231766 |
+----+------------+
16 rows in set (0.00 sec)
STEP 02 : Use LEFT JOIN
without a WHERE
clause
mysql> SELECT * FROM groupby A LEFT JOIN
-> (
-> SELECT num, MAX(id) id
-> FROM groupby
-> GROUP BY num
-> ) B USING (id);
+----+------------+------------+
| id | num | num |
+----+------------+------------+
| 1 | 269529129 | NULL |
| 2 | 387090406 | NULL |
| 3 | 1126864683 | NULL |
| 4 | 411160755 | NULL |
| 5 | 29173595 | NULL |
| 6 | 266349579 | NULL |
| 7 | 1244227156 | NULL |
| 8 | 6231766 | NULL |
| 9 | 269529129 | 269529129 |
| 10 | 387090406 | 387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 | 411160755 | 411160755 |
| 13 | 29173595 | 29173595 |
| 14 | 266349579 | 266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 | 6231766 | 6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)
mysql>
STEP 03 : Use LEFT JOIN
with a WHERE
clause
mysql> SELECT * FROM groupby A LEFT JOIN
-> (
-> SELECT num, MAX(id) id
-> FROM groupby
-> GROUP BY num
-> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num | num |
+----+------------+------------+
| 16 | 6231766 | 6231766 |
| 13 | 29173595 | 29173595 |
| 14 | 266349579 | 266349579 |
| 9 | 269529129 | 269529129 |
| 10 | 387090406 | 387090406 |
| 12 | 411160755 | 411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)
mysql>
ANALYSIS
Looking at the aforementioned results, here are two questions:
- Why does a
LEFT JOIN
keep an ordering by id
?
- Why in the world did using a
WHERE
impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?
No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.
CONCLUSION
From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY
in conjunction with the following twelve(12) aspects:
- aggregate functions
- subquery usage
JOINs
clauses
WHERE
clauses
- sort order of results with no explicit
ORDER BY
clause
- query results using older GA releases of MySQL
- query results using newer beta releases of MySQL
- the current SQL_MODE setting in
my.cnf
- the operating system the code was compiled for
- possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
- possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
- possibly the storage engine being used (MyISAM vs InnoDB)
Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.
Best Answer
No, there is no difference in efficiency between
(NOT) EXISTS (SELECT 1 ...)
and(NOT) EXISTS (SELECT * ...)
in all major DBMS. I've often seen(NOT) EXISTS (SELECT NULL ...)
being used as well.In some you can even write
(NOT) EXISTS (SELECT 1/0 ...)
and the result is the same - without any (division by zero) error, which proves that the expression there is not even evaluated.About the
LEFT JOIN / IS NULL
antijoin method, a correction: this is equivalent toNOT EXISTS (SELECT ...)
.In this case,
NOT EXISTS
vsLEFT JOIN / IS NULL
, you may get different execution plans. In MySQL for example and mostly in older versions (before 5.7) the plans would be fairly similar but not identical. The optimizers of other DBMS (SQL Server, Oracle, Postgres, DB2) are - as far as I know - more or less capable of rewriting these 2 methods and considering the same plans for both. Still, there is no such guarantee and when doing optimization, it is good to check the plans from different equivalent rewrites as there could be cases that each optimizer doesn't rewrite (eg. complex queries, with many joins and/or derived tables / subqueries inside the subquery, where conditions from multiple tables, composite columns used in the joining conditions) or the optimizer choices and plans are affected differently by the available indexes, settings, etc.Also note that
USING
cannot be used in all DBMS (SQL Server for example). The more commonJOIN ... ON
works everywhere.And the columns needs to be prefixed with the table name/alias in the
SELECT
to avoid errors/ambiguities when we have joins.I also usually prefer to put the joined column in the
IS NULL
check (although the PK or any non-nullable column would be OK, it might be useful for efficiency when the plan forLEFT JOIN
uses a non-clustered index):There is also a third method for antijoins, using
NOT IN
but this has different semantics (and results!) if the column of the inside table is nullable. It can be used though by excluding the rows withNULL
, making the query equivalent to the previous 2 versions:This also usually yields similar plans in most DBMS.