Table in question:
CREATE TABLE `observations` (
`time_of_observation` datetime NOT NULL,
`station` varchar(5) NOT NULL,
`number` varchar(20) NOT NULL,
`destination` varchar(100) NOT NULL,
`type` varchar(20) NOT NULL,
`departure_time` datetime NOT NULL,
`owner` varchar(20) NOT NULL,
`altered` tinyint(1) NOT NULL DEFAULT '0',
`delay` bigint(20) NOT NULL DEFAULT '0',
`notes` text CHARACTER SET latin1,
`route_text` text,
`info` text CHARACTER SET latin1,
`delay_text` text CHARACTER SET latin1,
KEY `uindex2` (`time_of_observation`,`departure_time`,`number`,`delay`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
With index:
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| observations | 1 | uindex2 | 1 | time_of_observation | A | 58588 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 2 | departure_time | A | 5507365 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 3 | number | A | 11014731 | NULL | NULL | | BTREE | | |
| observations | 1 | uindex2 | 4 | delay | A | 11014731 | NULL | NULL | | BTREE | | |
+--------------+------------+----------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Query:
SELECT time_of_observation as last_seen_on, number, departure_time,
departure_delay as last_known_delay
FROM observations as o1
WHERE NOT EXISTS (
SELECT 1
FROM observations o2
WHERE o1.number = o2.number
AND o1.departure_time = o2.departure_time
AND o1.time_of_observation < o2.time_of_observation
);`
number
and departure_time
uniquely identify an observed entity. Goal of this query is to find for each entity the last observation and the delay seen on this last observation.
This query is currently running for 140 hours (and has not finished) with the table containing ~11 million rows.
See EXPLAIN:
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | PRIMARY | o1 | index | NULL | uindex2 | 86 | NULL | 11014731 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | o2 | index | uindex2 | uindex2 | 86 | NULL | 11014731 | Using where; Using index |
+------+--------------------+-------+-------+---------------+---------+---------+------+----------+--------------------------+
Am I doing something wrong here? Is it possible to optimize this query? Should it be this slow?
Best Answer
First of all you need to get rid of the dependent subquery.
For this you could rewrite your query like this:
or this:
Then it would be a good idea if you had a primary key. It can be an auto_increment column, too.
The columns in the index you have are in the wrong order.
time_of_observation
should be the last column in the index. First should be the columns where you make an equality comparison, then the ones where you scan a range.Create an index like
(number, departure_time, time_of_observation)
and try one of the two queries from above.