Mysql – LEFT JOIN performance for WHERE on secondary table

MySQL

I've been developing for years and through out the carrier I've been using only LEFT JOIN as it provides a more logical sense of mapping my one to many data.

However recently I ran across a performance thought:

Say I have table user with millions of rows and table gender with only 2 rows

user                                gender
=====================               ================
id name      genderId               id gender show
=====================               ================
1  Sam       1                      1  Male   0
2  Samantha  2                      2  Female 1
3  Another millions of records

If I'm doing

select * from user u left join gender g on u.genderId = g.id where g.show = 1;

Does SQL engine joins the whole million user records first, then scan the million rows of records for join projected value?

Or is the engine smart enough to query and join only the rows with correct gender? Or is RIGHT JOIN the correct method to provide best performance?

If the behavior is dependent on engine type, let's talk about MySQL InnoDB.

Best Answer

Remember how SQL works: construct a set of tuples (rows) based on the given tables and joins, then filter those based on any where clause.
Because you have a where condition on the "right" table, your outer join is effectively converted into an inner join.

To retain the "outer-ness" of the join, put the filter condition into the join construct, rather than in the where clause:

select * 
from      user u 
left join gender g 
     on  u.genderId = g.id 
     and g.show = 1 
[where ...]

Related Solutions

Mysql – Slow performance of MySQL Join Query

Please provide SHOW CREATE TABLE; the explain is useless without it.

OR is a performance killer in many contexts.

( p.pricelist = "name_abc" AND p.iln = "sellerID_123" ) OR ( p.pricelist = "name_def" AND p.iln = "sellerID_456" ) OR ...

Turn that into

JOIN ( SELECT id FROM p WHERE 
( p.pricelist = "name_abc" AND p.iln = "sellerID_123" ) OR 
( p.pricelist = "name_def" AND p.iln = "sellerID_456" ) OR ... ) x ON x.id = foo.id

Also needed (on p):

INDEX(pricelist, iln, id)

(With the CREATEs, I could be more specific.)

The idea behind this "trick" is to move the costly work of the OR into a subquery that returns the necessary ids. Plus the INDEX makes it so that it can do all that work in the INDEX.

Mysql – slow query on thesql innodb table with 2 million rows

Your query formatted a bit:

SELECT DATE_FORMAT(dateCreated, '%Y-%m-%d') AS date,
       eventPriority,
       count(*) AS total
FROM `events`
WHERE customerName="Customer A"
  AND processed=1
  AND ticketStatus="CREATE"
  AND dateCreated >= "2015-07-01"
  AND dateCreated <= "2015-10-01"
GROUP BY date, eventPriority
ORDER BY `date` ASC;

The explain shows that the trends index is used and it is defined as

KEY `trends` (
    `customerName`(50),
    `processed`,
    `ticketStatus`,
    `dateCreated`,
    `eventName`(128),
    `eventPriority`
);

As we can see, the index contains all the columns of your query such a way that we can quickly get all events of "Customer A" which are "processed" and have status "CREATED" and then we can just read the right range for dateCreated. That seems really good, but as we can see from the EXPLAIN, there is one additional step after those - for all the index items found to satisfy the conditions it takes the ID (PRIMARY KEY is a part of every secondary index in InnoDB) and locates the row in the "main" part of the table. The values of those IDs are quite randomly distributed (even if we expect that autoincrement somehow coincides with the dateCreated, it is still probabůy not any continuous set, because events of many customers are inside that date range) so that means the query is doing ~140k random lookups by primary key to locate all the rows and build the temp table used for grouping.

The index contains all the columns used in your query (even the eventPriority) so it looks like it might be covering and this last step not necessary. But because the `customerName`(50) part, MySQL is afraid that it will not be able to exactly check the customerName value, because if there were a row with text longer than 50 characters, the index would contain only a part of it. We can see that it is not needed for the "Customer A" as it is quite short - but this optimization is not yet implemented so MySQL just has to get the full row.

When you remove that limit from the index, MySQL can see that it has all the data to resolve your query just by reading the index items (as even eventPriority is included - and eventName but that can be skipped). So you save tens of thousands of "random" lookups because the index contains what you need and it is sorted such a way that it is accessible fast.

===

There might be one more possibility for making it even faster. Currently the GROUP BY date means grouping by result of a function, that has to be computed for every matching row so it is not possible to use index for that. You might create another column in the table for storing the result of DATE_FORMAT(dateCreated, '%Y-%m-%d') - it might be managed by a trigger, by your application or using a new MySQL feature - generated columns. Then you would just add that to your index (probably the best place would be just before dateCreated).

But it is probably not needed unless some customer generates really many events for some date range or if you needed to group very long ranges.

Best Answer

Related Solutions

Mysql – Slow performance of MySQL Join Query

Mysql – slow query on thesql innodb table with 2 million rows

Related Question