MySQL performance, UNION then JOIN, or JOIN on each part of the UNION

MySQL

Which does the optimiser handle better?

SELECT TBL.*,J.* FROM (
    (SELECT ...,T.ThingId FROM ...)
    UNION
    (SELECT ...,T.ThingId FROM ...)
) TBL
LEFT JOIN ThingDetails TD USING(ThingId)
....

(SELECT ...,T.ThingId,TD.* FROM ... LEFT JOIN ThingDetails TD USING(ThingId))
UNION
(SELECT ...,T.ThingId,TD.* FROM ... LEFT JOIN ThingDetails TD USING(ThingId))

I would imagine in simple cases that it doesn't matter much, but I wondered what recommendations you guys may have. I don't want to experiment much here because of the time taken to actually write test queries and even then they'll be very specific cases. I'd like to know what the optimiser can apply here.

As an example of an optimisation I know through experience, replacing an OR with a XOR if you know both operands cannot both be true can help!

I'd also be curious to know about WHERE statements, On each part of the union, or outside? Some testing shows that it actually may matter (an <autokey0> pops up if I use a where outside of union) – I shall keep investigating but a point in the right direction would be great!

Any information on MariaDB or MySQL is welcome. I'm not looking for specific behaviour in each version and going back as far as 5.1 at least MySQL is able to factor in and out various constraints, I'm looking for detailed optimiser behaviour.

My own limited testing showed that MySQL both factored in some joins and factored out others. In fact one of my tests involved MySQL returning the first union in the order of an index of the table it was joining to's primary key. The as-if rule states that any optimisation is okay as long to the observer it is as if what they wrote actually happened. I've had problems with this before. It was actually a bug in MySQL 5.1 – I now know it's because of how InnoDB works.

This question is really about what 'confuses' the optimiser enough to not be able to make these observations.

Best Answer

It depends. Let's dissect each scenario.

( SELECT UNION ALL SELECT ) JOIN

breaks down into

create tmp table
do first select into that tmp (n1 rows)
do second select into that tmp (n2 rows)
dedup because of ALL -- change to DISTINCT to skip this step (n3=n1+n2, or fewer rows)
join n3 times

And...

( SELECT JOIN ) UNION ALL ( SELECT JOIN )

breaks down into

create tmp table
do first select and join into that tmp (n1 rows & n1 joins)
do second select and join into that tmp (n2 rows & n2 joins)
dedup because of ALL -- change to DISTINCT to skip this step (n3=n1+n2, or fewer rows)

So it depends on which case you have.

If the JOIN filters out some rows, then whatever follows it will have less work.
If you switch to UNION DISTINCT, that will shrink the subsequent work.
If the JOIN adds lots of bulky columns, the join-first case is building a bigger tmp table.

Watch out for <autokey0> -- it means you have a tmp table without an index, but 5.6 was smart enough to create one for you. Your SELECT, as written, should not need that unless ThingsDetails does not have an index on ThingsId. If that is the case, then add an index rather than taking advantage of autokey.

Related Solutions

Mysql – How does MySQL coerce types during joins

You need to make absolutely sure the character sets of the tables are identical. Keep in mind that a unicode issues could be at play here.

You could redo the query like this:

select FOOBAAR_hhs.relationship_group from FOOBAAR left join FOOBAAR_hhs
on FOOBAAR.relationship_group = FOOBAAR_hhs.relationship_group where new_hh_id = 15387929;

select FOOBAAR_hhs.relationship_group from FOOBAAR left join FOOBAAR_hhs
USING (relationship_group) where new_hh_id = 15387929;

Try the explain for these and see if it changes

Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.