MySQL performance, UNION then JOIN, or JOIN on each part of the UNION

MySQL

Which does the optimiser handle better?

SELECT TBL.*,J.* FROM (
    (SELECT ...,T.ThingId FROM ...)
    UNION
    (SELECT ...,T.ThingId FROM ...)
) TBL
LEFT JOIN ThingDetails TD USING(ThingId)
....

Or

(SELECT ...,T.ThingId,TD.* FROM ... LEFT JOIN ThingDetails TD USING(ThingId))
UNION
(SELECT ...,T.ThingId,TD.* FROM ... LEFT JOIN ThingDetails TD USING(ThingId))

I would imagine in simple cases that it doesn't matter much, but I wondered what recommendations you guys may have. I don't want to experiment much here because of the time taken to actually write test queries and even then they'll be very specific cases. I'd like to know what the optimiser can apply here.

As an example of an optimisation I know through experience, replacing an OR with a XOR if you know both operands cannot both be true can help!

I'd also be curious to know about WHERE statements, On each part of the union, or outside? Some testing shows that it actually may matter (an <autokey0> pops up if I use a where outside of union) – I shall keep investigating but a point in the right direction would be great!

Any information on MariaDB or MySQL is welcome. I'm not looking for specific behaviour in each version and going back as far as 5.1 at least MySQL is able to factor in and out various constraints, I'm looking for detailed optimiser behaviour.

My own limited testing showed that MySQL both factored in some joins and factored out others. In fact one of my tests involved MySQL returning the first union in the order of an index of the table it was joining to's primary key. The as-if rule states that any optimisation is okay as long to the observer it is as if what they wrote actually happened. I've had problems with this before. It was actually a bug in MySQL 5.1 – I now know it's because of how InnoDB works.

This question is really about what 'confuses' the optimiser enough to not be able to make these observations.

Best Answer

It depends. Let's dissect each scenario.

( SELECT UNION ALL SELECT ) JOIN

breaks down into

  1. create tmp table
  2. do first select into that tmp (n1 rows)
  3. do second select into that tmp (n2 rows)
  4. dedup because of ALL -- change to DISTINCT to skip this step (n3=n1+n2, or fewer rows)
  5. join n3 times

And...

( SELECT JOIN ) UNION ALL ( SELECT JOIN )

breaks down into

  1. create tmp table
  2. do first select and join into that tmp (n1 rows & n1 joins)
  3. do second select and join into that tmp (n2 rows & n2 joins)
  4. dedup because of ALL -- change to DISTINCT to skip this step (n3=n1+n2, or fewer rows)

So it depends on which case you have.

  • If the JOIN filters out some rows, then whatever follows it will have less work.
  • If you switch to UNION DISTINCT, that will shrink the subsequent work.
  • If the JOIN adds lots of bulky columns, the join-first case is building a bigger tmp table.

Watch out for <autokey0> -- it means you have a tmp table without an index, but 5.6 was smart enough to create one for you. Your SELECT, as written, should not need that unless ThingsDetails does not have an index on ThingsId. If that is the case, then add an index rather than taking advantage of autokey.