Mysql – Understanding optimisation of complex SELECT queries

MySQLoptimization

I have a people table with basic contact info in. I have a bunch of criteria that can select Id values (fields called pid) from the people table.

  1. I want records that match ((criterion 1 or criterion 2) and criterion 3).

  2. I want the criteria statements to add columns to the results.

  3. I need a standard way to assemble the queries because I'm making a 'search builder'; so I can't optimise every combination. It can be assumed that each criterion will return zero or one row.

EG1: I first tried

SELECT people.*
  FROM people
       LEFT JOIN (criterion1 SELECT) c1 ON people.id=c1.pid
       LEFT JOIN (criterion2 SELECT) c2 ON people.id=c2.pid,
       (criterion3 SELECT) c3
 WHERE people.id=c3.pid
   AND ( c1.pid IS NOT NULL OR c2.pid IS NOT NULL );

This was far too slow – it never actually returned records in the minute I allowed it!

EG2:Next I tried

SELECT people.*
  FROM people,
       (       (criterion1 SELECT) 
         UNION (criterion2 SELECT) ) c12
 WHERE people.id=c12.pid
   AND people.id IN (criterion3 SELECT)

This one returns 2000+ rows in 0.6s, which is acceptable. But you can't add any columns from criterion 3, and can't access columns from c1 and c2 either, really, since one will mask another.

EG3:Out of interest, I moved the c12 into the WHERE clause:

SELECT people.*
  FROM people
 WHERE id IN (criterion3 SELECT)
   AND id IN ( (criterion1 SELECT) 
               UNION (criterion2 SELECT) )

but again, this threw it for well over a minute before I killed the query.

EG4: So I returned to the EG2 code, and did this ugly thing:

SELECT * 
  FROM ( all the EG2 SQL ) src
       LEFT JOIN (criterion1 SELECT) c1 ON src.id=c1.pid
       LEFT JOIN (criterion2 SELECT) c2 ON src.id=c2.pid

So the c1 and c2 queries are added in twice! This returns the (same number of) results, with the extra fields, in 6s. Too slow, really, but at least it runs. And really ugly!

Can anyone give me any pointers to explain any of this? Why is EG3 massively faster than EG1/EG4 when they're all doing the same thing? Why should running the queries twice (EG4) be faster than running them once (EG1)?!

EDIT: Using MySQL 5.1, InnoDB tables.

Best Answer

  • Why there are massive differences in performance between the queries?
    (assuming that you have at least defined indexes for the columns used in the subquery criteria and in the joining conditions)

Basically because MySQL optimizer is not smart enough to figure out that all these queries are equivalent. So, it probably produces different execution plans for the different queries. If you do not have properly defined FOREIGN KEY constraints, the optimizer may be actually right, there is no guarantee that the queries return identical results.

  • So, what to do to increase performance and have a standard way of writing this type of query variants?

Several things that affect MySQL queries performance:

  1. Do not use id IN (SELECT subquery) if you can avoid it. It's not very well optimized in most MySQL versions (see point 6 below). Use joins if you can.

  2. Replace UNION with UNION ALL if that doesn't change the result set (you could do that in EG3 query).

  3. Don't use implicit joins (with commas in the FROM clause and the joining condition in the WHERE clause). Use explicit JOIN syntax. As an example, your EG1 is actually the same as (this is not for performance but for consistence):

    SELECT people.*
      FROM people
           LEFT JOIN (criterion1 SELECT) c1 ON people.id=c1.pid
           LEFT JOIN (criterion2 SELECT) c2 ON people.id=c2.pid,
           JOIN (criterion3 SELECT) c3 ON people.id=c3.pid
     WHERE ( c1.pid IS NOT NULL OR c2.pid IS NOT NULL );
    
  4. The above query has an OR condition that is relevant to 2 tables (and their joins to the people, so 3 tables actually). This is usually not very good perfomance-wise.

  5. You can try rewriting using EXISTS. This will make your queries easier to write and it may help with performance, too:

    SELECT people.*
      FROM people
     WHERE EXISTS 
             (criterion3 SELECT modified with `people.id = some_table.pid`)
       AND ( EXISTS 
               (criterion1 SELECT modified with `people.id = some_table.pid`)
          OR EXISTS 
               (criterion2 SELECT modified with `people.id = some_table.pid`) 
           ) ;
    
  6. If your host allows it, try/test MariaDB (it's a MySQL fork-replacement) that has introduced several improvements in queries execution in its latest versions. The optimizer will be a bit smarter identifying equivalent queries and certainly smarter as it has some new algorithms implemented that affect queries will joins, subqueries among other things.

MySQL 5.6 has also a few improvements in the optimizer but it's not available yet as a stable release.