Mysql – Individual queries run on 10ms, with UNION ALL they are taking 290ms+ (7.7M records MySQL DB). How to optimise

MySQL

I have a table that stores available appointments for teachers, allowing two kinds of insertions:

Hourly based: with total freedom to add unlimited slots per day per teacher (as long as slots don't overlap): on 15/Apr a teacher may have slots at 10:00, 11:00, 12:00 and 16:00. A person is served after choosing a specific teacher time/slot.
Time period/range: on 15/Apr another teacher may work from 10:00 to 12:00 and then from 14:00 to 18:00. A person is served by order of arrival, so if a teacher works from 10:00 to 12:00, all persons that arrive in this period will be attended by order of arrival (local queue).

Since I have to return all available teachers in a search, I need all slots to be saved in the same table as the order of arrival ranges. This way I can order by date_from ASC, showing the first available slots first on the search results.

Current table structure

CREATE TABLE `teacher_slots` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `teacher_id` mediumint(8) unsigned NOT NULL,
  `city_id` smallint(5) unsigned NOT NULL,
  `subject_id` smallint(5) unsigned NOT NULL,
  `date_from` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `date_to` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `status` tinyint(4) NOT NULL DEFAULT '0',
  `order_of_arrival` tinyint(1) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `by_hour_idx` (`teacher_id`,`order_of_arrival`,`status`,`city_id`,`subject_id`,`date_from`),
  KEY `order_arrival_idx` (`order_of_arrival`,`status`,`city_id`,`subject_id`,`date_from`,`date_to`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Search query

I need to filter by: actual datetime, city_id, subject_id and if a slot is available (status = 0).

For hourly based I have to show all available slots for the first closest available day for every teacher (show all time slots of a given day and can't show more than one day for the same teacher). (I got the query with the help from mattedgod).

For range based (order_of_arrival = 1), I have to show the closest available range, just one time per teacher.

First query runs individually in around 0.10 ms, second query 0.08 ms and the UNION ALL an average of 300ms.

(
    SELECT id, teacher_slots.teacher_id, date_from, date_to, order_of_arrival
    FROM teacher_slots
    JOIN (
        SELECT DATE(MIN(date_from)) as closestDay, teacher_id
        FROM teacher_slots
        WHERE   date_from >= '2014-04-10 08:00:00' AND order_of_arrival = 0
                AND status = 0 AND city_id = 6015 AND subject_id = 1
        GROUP BY teacher_id
    ) a ON a.teacher_id = teacher_slots.teacher_id
    AND DATE(teacher_slots.date_from) = closestDay
    WHERE teacher_slots.date_from >= '2014-04-10 08:00:00'
        AND teacher_slots.order_of_arrival = 0
        AND teacher_slots.status = 0
        AND teacher_slots.city_id = 6015
        AND teacher_slots.subject_id = 1
)

UNION ALL

(
    SELECT id, teacher_id, date_from, date_to, order_of_arrival
    FROM teacher_slots
    WHERE order_of_arrival = 1 AND status = 0 AND city_id = 6015 AND subject_id = 1
        AND (
            (date_from <= '2014-04-10 08:00:00' AND  date_to >= '2014-04-10 08:00:00')
            OR (date_from >= '2014-04-10 08:00:00')
        )
    GROUP BY teacher_id
)

ORDER BY date_from ASC;

Question

Is there a way to optimise the UNION, so I can get a reasonable response of a maximum ~20ms or even return range based + hourly based in just one query (with an IF, etc)?

SQL Fiddle: http://www.sqlfiddle.com/#!2/59420/1/0

EDIT:

I tried some denormalization by creating a field "only_date_from" where I stored only the date, so I could change this…

DATE(MIN(date_from)) as closestDay / DATE(teacher_slots.date_from) = closestDay

… to this

MIN(only_date_from) as closestDay / teacher_slots.only_date_from = closestDay

It already saved me 100ms! Still 200ms on average.

Best Answer

Firstly, I think your original query may not be "correct"; With reference to your SQLFiddle, it looks to me as though you should be returning rows with ID = 2, 3 and 4 (in addition to the row with ID = 1 you are getting from this half), because your existing logic appears as though you intended for these other rows to be included, as they explicitly meet the OR (date_from >= '2014-04-10 08:00:00') part of your second WHERE clause.

The GROUP BY teacher_id clause in your second part of your UNION is causing you to lose those rows. This is because you're not actually aggregating any columns in your select list, and in this case the GROUP BY will cause 'difficult to define' behaviour.

Also, while I can't explain the poor performance of your UNION, I can work around it for you by outright removing it from your query:

Rather than using two separate (and in parts, repeating) sets of logic to get rows from the same table, I've consolidated your logic into one query with the differences in your logic ORed together - ie if a row meets one or the other of your original WHERE clauses, it's included. This is possible because I've replaced the (INNER) JOIN you were using to find the closestDate with a LEFT JOIN.

This LEFT JOIN means we are now also able to distinguish which set of logic should be applied to a row; If the join works (closestDate IS NOT NULL) we apply your logic from the first half, but if the join fails (closestDate IS NULL) then we apply the logic from your second half.

So this will return all the rows that your query returned (in the fiddle), and it's also picking up those additional ones.

  SELECT
    *

  FROM 
    teacher_slots ts

    LEFT JOIN 
    (
      SELECT 
        teacher_id,
        DATE(MIN(date_from)) as closestDay

      FROM 
        teacher_slots

      WHERE   
        date_from >= '2014-04-10 08:00:00' 
        AND order_of_arrival = 0
        AND status = 0 
        AND city_id = 6015 
        AND subject_id = 1

      GROUP BY 
        teacher_id

    ) a
    ON a.teacher_id = ts.teacher_id
    AND a.closestDay = DATE(ts.date_from)

  WHERE 
    /* conditions that were common to both halves of the union */
    ts.status = 0
    AND ts.city_id = 6015
    AND ts.subject_id = 1

    AND
    (
      (
        /* conditions that were from above the union 
           (ie when we joined to get closest future date) */
        a.teacher_id IS NOT NULL
        AND ts.date_from >= '2014-04-10 08:00:00'
        AND ts.order_of_arrival = 0
      ) 
      OR
      (
        /* conditions that were below the union 
          (ie when we didn't join) */
        a.teacher_id IS NULL       
        AND ts.order_of_arrival = 1 
        AND 
        (
          (
            date_from <= '2014-04-10 08:00:00' 
            AND  
            date_to >= '2014-04-10 08:00:00'
          )

          /* rows that met this condition were being discarded 
             as a result of 'difficult to define' GROUP BY behaviour. */
          OR date_from >= '2014-04-10 08:00:00' 
        )
      )
    )

  ORDER BY 
   ts.date_from ASC;

Further, you can "tidy up" your query further so that you don't need to "plug in" your status, city_id and subject_id parameters more than once.

To do this, change the subquery a to also select those columns, and to also group on those columns. Then, the JOIN's ON clause would need to map those columns to their ts.xxx equivalents.

I don't think this will negatively effect performance, but couldn't be sure without testing on a large dataset.

So your join will look more like:

LEFT JOIN 
(
  SELECT 
    teacher_id,
    status,
    city_id,
    subject_id,
    DATE(MIN(date_from)) as closestDay

  FROM 
    teacher_slots

  WHERE   
    date_from >= '2014-04-10 08:00:00' 
    AND order_of_arrival = 0
  /* These no longer required here...
    AND status = 0 
    AND city_id = 6015 
    AND subject_id = 1
  */

  GROUP BY 
    teacher_id,
    status,
    city_id,
    subject_id

) a
ON a.teacher_id = ts.teacher_id
AND a.status = ts.status 
AND a.city_id = ts.city_id 
AND a.subject_id = ts.city_id
AND a.closestDay = DATE(ts.date_from)

UPDATE 2012-02-21 17:06 EDT

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
TAB_EVENT e LEFT JOIN
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

@Sebastian, I put the query back in the original join order. Please try this as well !!!

UPDATE 2012-02-21 17:11 EDT

Make sure the datetime field is indexed

ALTER TABLE TAB_MESSAGE ADD INDEX read_datetime_ndx (read,datetime);

MySQL: Optimize UNION with “ORDER BY” in inner queries

Just out of curiosity, can you try this version? It may trick the optimizer to use the same indices that the subqueries would use separately:

SELECT *
FROM
(SELECT l.id, l.account_id, l.vnum, l.count, l.preis, l.zeitpunkt AS zeit,
 'hp' AS source FROM is_log AS l WHERE l.account_id = 730
 ORDER BY l.zeitpunkt DESC LIMIT 10) 
    AS a

UNION ALL

SELECT *
FROM
(SELECT l.id, l.account_id, l.vnum, l.count, l.preis, l.zeitpunkt,
 'ig' AS source FROM ig_is_log AS l WHERE l.account_id = 730
 ORDER BY l.zeitpunkt DESC LIMIT 10)
    AS b

ORDER BY zeit DESC LIMIT 10;

I still think that the best index you could have is the compound (account_id, zeitpunkt). It would yield the 10 rows fast, and no tricks would ne needed.