I have a people table with basic contact info in. I have a bunch of criteria that can select Id values (fields called pid) from the people table.
-
I want records that match ((criterion 1 or criterion 2) and criterion 3).
-
I want the criteria statements to add columns to the results.
-
I need a standard way to assemble the queries because I'm making a 'search builder'; so I can't optimise every combination. It can be assumed that each criterion will return zero or one row.
EG1: I first tried
SELECT people.*
FROM people
LEFT JOIN (criterion1 SELECT) c1 ON people.id=c1.pid
LEFT JOIN (criterion2 SELECT) c2 ON people.id=c2.pid,
(criterion3 SELECT) c3
WHERE people.id=c3.pid
AND ( c1.pid IS NOT NULL OR c2.pid IS NOT NULL );
This was far too slow – it never actually returned records in the minute I allowed it!
EG2:Next I tried
SELECT people.*
FROM people,
( (criterion1 SELECT)
UNION (criterion2 SELECT) ) c12
WHERE people.id=c12.pid
AND people.id IN (criterion3 SELECT)
This one returns 2000+ rows in 0.6s, which is acceptable. But you can't add any columns from criterion 3, and can't access columns from c1 and c2 either, really, since one will mask another.
EG3:Out of interest, I moved the c12 into the WHERE clause:
SELECT people.*
FROM people
WHERE id IN (criterion3 SELECT)
AND id IN ( (criterion1 SELECT)
UNION (criterion2 SELECT) )
but again, this threw it for well over a minute before I killed the query.
EG4: So I returned to the EG2 code, and did this ugly thing:
SELECT *
FROM ( all the EG2 SQL ) src
LEFT JOIN (criterion1 SELECT) c1 ON src.id=c1.pid
LEFT JOIN (criterion2 SELECT) c2 ON src.id=c2.pid
So the c1 and c2 queries are added in twice! This returns the (same number of) results, with the extra fields, in 6s. Too slow, really, but at least it runs. And really ugly!
Can anyone give me any pointers to explain any of this? Why is EG3 massively faster than EG1/EG4 when they're all doing the same thing? Why should running the queries twice (EG4) be faster than running them once (EG1)?!
EDIT: Using MySQL 5.1, InnoDB tables.
Best Answer
(assuming that you have at least defined indexes for the columns used in the subquery criteria and in the joining conditions)
Basically because MySQL optimizer is not smart enough to figure out that all these queries are equivalent. So, it probably produces different execution plans for the different queries. If you do not have properly defined
FOREIGN KEY
constraints, the optimizer may be actually right, there is no guarantee that the queries return identical results.Several things that affect MySQL queries performance:
Do not use
id IN (SELECT subquery)
if you can avoid it. It's not very well optimized in most MySQL versions (see point 6 below). Use joins if you can.Replace
UNION
withUNION ALL
if that doesn't change the result set (you could do that in EG3 query).Don't use implicit joins (with commas in the
FROM
clause and the joining condition in theWHERE
clause). Use explicitJOIN
syntax. As an example, your EG1 is actually the same as (this is not for performance but for consistence):The above query has an
OR
condition that is relevant to 2 tables (and their joins to thepeople
, so 3 tables actually). This is usually not very good perfomance-wise.You can try rewriting using
EXISTS
. This will make your queries easier to write and it may help with performance, too:If your host allows it, try/test MariaDB (it's a MySQL fork-replacement) that has introduced several improvements in queries execution in its latest versions. The optimizer will be a bit smarter identifying equivalent queries and certainly smarter as it has some new algorithms implemented that affect queries will joins, subqueries among other things.
MySQL 5.6 has also a few improvements in the optimizer but it's not available yet as a stable release.