Sql-server – EXISTS (SELECT 1 …) vs EXISTS (SELECT * …) One or the other

MySQLoraclepostgresqlsql server

Whenever I need to check for the existence of some row in a table, I tend to write always a condition like:

SELECT a, b, c
  FROM a_table
 WHERE EXISTS
       (SELECT *  -- This is what I normally write
          FROM another_table
         WHERE another_table.b = a_table.b
       )

Some other people write it like:

SELECT a, b, c
  FROM a_table
 WHERE EXISTS
       (SELECT 1   --- This nice '1' is what I have seen other people use
          FROM another_table
         WHERE another_table.b = a_table.b
       )

When the condition is NOT EXISTS instead of EXISTS: In some occasions, I might write it with a LEFT JOIN and an extra condition (sometimes called an antijoin):

SELECT a, b, c
  FROM a_table
       LEFT JOIN another_table ON another_table.b = a_table.b
 WHERE another_table.primary_key IS NULL

I try to avoid it because I think the meaning is less clear, specially when what is your primary_key is not that obvious, or when your primary key or your join condition is multi-column (and you can easily forget one of the columns). However, sometimes you maintain code written by somebody else… and it is just there.

Is there any difference (other than style) to use SELECT 1 instead of SELECT *?
Is there any corner case where it does not behave the same way?
Although what I wrote is (AFAIK) standard SQL: Is there such a difference for different databases / older versions?
Is there any advantage on explicity writing an antijoin?
Do contemporary planners/optimizers treat it differently from the NOT EXISTS clause?

Best Answer

No, there is no difference in efficiency between (NOT) EXISTS (SELECT 1 ...) and (NOT) EXISTS (SELECT * ...) in all major DBMS. I've often seen (NOT) EXISTS (SELECT NULL ...) being used as well.

In some you can even write (NOT) EXISTS (SELECT 1/0 ...) and the result is the same - without any (division by zero) error, which proves that the expression there is not even evaluated.

About the LEFT JOIN / IS NULL antijoin method, a correction: this is equivalent to NOT EXISTS (SELECT ...).

In this case, NOT EXISTS vs LEFT JOIN / IS NULL, you may get different execution plans. In MySQL for example and mostly in older versions (before 5.7) the plans would be fairly similar but not identical. The optimizers of other DBMS (SQL Server, Oracle, Postgres, DB2) are - as far as I know - more or less capable of rewriting these 2 methods and considering the same plans for both. Still, there is no such guarantee and when doing optimization, it is good to check the plans from different equivalent rewrites as there could be cases that each optimizer doesn't rewrite (eg. complex queries, with many joins and/or derived tables / subqueries inside the subquery, where conditions from multiple tables, composite columns used in the joining conditions) or the optimizer choices and plans are affected differently by the available indexes, settings, etc.

Also note that USING cannot be used in all DBMS (SQL Server for example). The more common JOIN ... ON works everywhere.
And the columns needs to be prefixed with the table name/alias in the SELECT to avoid errors/ambiguities when we have joins.
I also usually prefer to put the joined column in the IS NULL check (although the PK or any non-nullable column would be OK, it might be useful for efficiency when the plan for LEFT JOIN uses a non-clustered index):

SELECT a_table.a, a_table.b, a_table.c
  FROM a_table
       LEFT JOIN another_table 
           ON another_table.b = a_table.b
 WHERE another_table.b IS NULL ;

There is also a third method for antijoins, using NOT IN but this has different semantics (and results!) if the column of the inside table is nullable. It can be used though by excluding the rows with NULL, making the query equivalent to the previous 2 versions:

SELECT a, b, c
  FROM a_table
 WHERE a_table.b NOT IN 
       (SELECT another_table.b
          FROM another_table
         WHERE another_table.b IS NOT NULL
       ) ;

This also usually yields similar plans in most DBMS.

Related Solutions

Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.

Mysql – LEFT JOIN DEFAULT NULL VALUE

If I understand correctly, you could have any number of columns from table B? If so, then you will need to dynamically create the SQL statement (by querying the information_schema table) and just wrap each column from B in your ifnull(field, ifnullvalue).