Mysql – Is it possible to optimize this query? Currently taking days (no sign of stopping) on ~11 million records

mariadbMySQLoptimization

Table in question:

CREATE TABLE `observations` (
  `time_of_observation` datetime NOT NULL,
  `station` varchar(5) NOT NULL,
  `number` varchar(20) NOT NULL,
  `destination` varchar(100) NOT NULL,
  `type` varchar(20) NOT NULL,
  `departure_time` datetime NOT NULL,
  `owner` varchar(20) NOT NULL,
  `altered` tinyint(1) NOT NULL DEFAULT '0',
  `delay` bigint(20) NOT NULL DEFAULT '0',
  `notes` text CHARACTER SET latin1,
  `route_text` text,
  `info` text CHARACTER SET latin1,
  `delay_text` text CHARACTER SET latin1,
  KEY `uindex2` (`time_of_observation`,`departure_time`,`number`,`delay`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With index:

+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table        | Non_unique | Key_name | Seq_in_index | Column_name         | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| observations |          1 | uindex2  |            1 | time_of_observation | A         |       58588 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            2 | departure_time      | A         |     5507365 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            3 | number              | A         |    11014731 |     NULL | NULL   |      | BTREE      |         |               |
| observations |          1 | uindex2  |            4 | delay               | A         |    11014731 |     NULL | NULL   |      | BTREE      |         |               |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Query:

SELECT time_of_observation as last_seen_on, number, departure_time,
       departure_delay as last_known_delay
FROM observations as o1
WHERE NOT EXISTS (
  SELECT 1 
  FROM observations o2
  WHERE o1.number = o2.number
  AND o1.departure_time = o2.departure_time
  AND o1.time_of_observation < o2.time_of_observation
);`

number and departure_time uniquely identify an observed entity. Goal of this query is to find for each entity the last observation and the delay seen on this last observation.

This query is currently running for 140 hours (and has not finished) with the table containing ~11 million rows.

See EXPLAIN:

+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| id   | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
|    1 | PRIMARY            | o1    | index | NULL          | uindex2 | 86      | NULL | 11014731 | Using where; Using index |
|    2 | DEPENDENT SUBQUERY | o2    | index | uindex2       | uindex2 | 86      | NULL | 11014731 | Using where; Using index |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+

Am I doing something wrong here? Is it possible to optimize this query? Should it be this slow?

Best Answer

First of all you need to get rid of the dependent subquery.
For this you could rewrite your query like this:

SELECT o.time_of_observation as last_seen_on, o.number,
       o.departure_time, o.departure_delay as last_known_delay
FROM observations as o
JOIN (  SELECT
        number,
        departure_time
        MIN(time_of_observation) AS minobstime
        FROM observations
        GROUP BY 
        number,
        departure_time
) m ON m.number = o.number
   AND m.departure_time = o.departure_time
   AND m.minobstime = o.time_of_observation;

or this:

SELECT o1.time_of_observation as last_seen_on, o1.number,
       o1.departure_time, o1.departure_delay as last_known_delay
FROM observations as o1
LEFT JOIN observations as o2 ON o1.number = o1.number
  AND o1.departure_time = o2.departure_time
  AND o1.time_of_observation < o2.time_of_observation
WHERE o2.<any column, preferably your primary key> IS NULL;

Then it would be a good idea if you had a primary key. It can be an auto_increment column, too.

The columns in the index you have are in the wrong order. time_of_observation should be the last column in the index. First should be the columns where you make an equality comparison, then the ones where you scan a range.

Create an index like (number, departure_time, time_of_observation) and try one of the two queries from above.