MySQL subqueries that use range based on values of main queries don’t use indices properly

MySQLoptimizationsubquery

I think I've isolated a problem that has been affecting many of my queries lately.
And would like some help to figure out a solution for this.

Ok so my findings are that a normal query that runs very fast using like a couple of rows can actually use indexes improperly when used in a subquery which is based on values from the main query.

Lets take an example:

DROP TEMPORARY TABLE IF EXISTS Person;
DROP TEMPORARY TABLE IF EXISTS CofeeBreaks;
CREATE TEMPORARY TABLE IF NOT EXISTS Person
(
    `person_id` INT(11) AUTO_INCREMENT,
    `age` INT,
    `lastCofee` DATETIME,
    KEY `idkey` (`person_id`) USING BTREE,
    KEY `datekey` (`lastCofee`) USING BTREE
) ENGINE = MEMORY;
CREATE TEMPORARY TABLE IF NOT EXISTS CofeeBreaks
(
    `id` INT(11) AUTO_INCREMENT,
    `cofeeBreakStart` DATETIME,
    `cofeeBreakEnd` DATETIME,
    KEY `brekIdKey`(`id`) USING BTREE
) ENGINE = MEMORY;
INSERT INTO Person (age, lastCofee) VALUES (24, '2013-03-27 14:45:34');
INSERT INTO Person (age, lastCofee) VALUES (34, '2013-03-27 14:46:38');
INSERT INTO Person (age, lastCofee) VALUES (26, '2013-03-27 15:25:24');
INSERT INTO Person (age, lastCofee) VALUES (28, '2013-03-27 16:33:54');
INSERT INTO Person (age, lastCofee) VALUES (46, '2013-03-27 17:11:03');
INSERT INTO CofeeBreaks (cofeeBreakStart, cofeeBreakEnd) VALUES ('2013-03-27 15:11:03', '2013-03-27 17:25:24');
INSERT INTO CofeeBreaks (cofeeBreakStart, cofeeBreakEnd) VALUES ('2013-03-27 14:45:34', '2013-03-27 15:25:24');
INSERT INTO CofeeBreaks (cofeeBreakStart, cofeeBreakEnd) VALUES ('2013-03-27 17:11:03', '2013-03-27 17:11:03');
SELECT * FROM Person WHERE lastCofee BETWEEN '2013-03-27 15:11:03' AND '2013-03-27 17:11:03';
SELECT
    *,
    (SELECT AVG(Person.age) FROM Person WHERE Person.lastCofee BETWEEN CofeeBreaks.cofeeBreakStart AND CofeeBreaks.cofeeBreakEnd) AS averageAge
FROM CofeeBreaks;

So the explain results for the first select are as follow:

1   SIMPLE  Person  range   datekey datekey 9       1   Using where

But the second query doesn't use the index properly in the subquery and analyses more rows than necessary:

id  select_type         table   type    possible_keys   key key_len ref rows
1   PRIMARY CofeeBreaks ALL                                               3
2   DEPENDENT SUBQUERY  Person  ALL       datekey                         5

As we can see the subquery needs to analyse all rows in the person table when none of the cofeebreaks ranges surrounds all of the 5 persons.

The way I've been fixing the performance issues in a very busy database is by doing a select into a temporary table and than looping the rows of that table and updating it with the aggregates that i need. This is obviously ugly code and shouldn't be built that way.
But I really haven't found a way to optimize queries for this kind of subqueries and I've tried a lot of possible ways to do this without success on the optimization.

Thanks in advance for any inputs.

Best Answer

This small data sample doesn't serve to illustrate that the behavior you are attempting to identify exists. Indeed, I've tested it on a larger data set and it did use the Index (using MySQL 5.5.30).

The problem is that when the optimizer determines that using an index would result in an inordinately large number of matches -- compared to the total number of rows in the table -- it won't use an index, because that could actually perform worse than simply scanning the whole table, and it will exhibit exactly the behavior this example illustrates... it knows the index is a candidate, but it chooses not to use it.

But I would suggest the problem lies in the fact that you're using a subquery in a place where a subquery isn't really necessary or called for.

I rewrote this as a join, because, from what I can tell, this is what you're asking the database to do: join each row in Person to the matching row(s) in CofeeBreaks where that person took their break during that window, and average the ages of the attendees.

I also built this here on SQL Fiddle. Removed TEMPORARY from the table definitions because the Fiddle doesn't seem to support them properly (because it probably uses a connection pool).

SELECT cb.id,
       cb.cofeeBreakStart,
       cb.cofeeBreakEnd,
       avg(p.age)
  FROM CofeeBreaks cb
  JOIN Person p ON p.lastCofee BETWEEN cb.cofeeBreakStart AND cb.cofeeBreakEnd
 GROUP BY cb.id;

EXPLAIN SELECT on this subquery shows that the index is being used, even on this small data set... although if you change that to a LEFT JOIN (which would show all coffee breaks even if nobody took that particular break, while the JOIN only includes breaks where at least one person did), the index shows up as being a candidate, but doesn't get used... again, likely because of the cost, and this behavior would likely be different with a larger data set.

The LEFT JOIN version of the query would produce identical results to your subquery regardless of the table data, while the JOIN version only produces identical results if every CofeeBreak had at least one person taking that break, which in your sample data, it does.

But using the indexes or not, a correlated subquery will not usually scale as well as a join.

http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html