Mysql – How to optimise query fetching most recent report for each entry in another table

MySQLperformancequery-performance

I am attempting to optimise a query that displays the last reported movement for a set railroad locomotives. The query currently takes 8-12s to run for a common query to generate a report on a webpage, which outputs around 5,000 rows.

The query involves the following steps:

  1. From the table t1 containing all known locomotives, select those belonging to a specific railroad (or railroads)
  2. From the table t2 containing movement reports, select the most recent report (highest value of entrynumber) referencing each entry selected from t1
  3. From those movement reports selected from t2, select only those in the last 30 days
  4. Join the selected reports in t2 to a matching entry in t3, if one exists; t2 will contain a NULL entry if not.

The end result is that for each entry in t1, the most recent report in t2 is output, unless that report is more than 30 days old.

An example query is:

SELECT t1.railroad, t1.number, t1.typeid, t2.date, t2.yard, t2.yardmaster, t2.trainsymbol, t3.trainuid 
FROM locodata AS t1 
LEFT JOIN locoreports AS t2 ON t2.locouid = t1.uid
LEFT JOIN tsarreports AS t3 ON t3.entrynumber = t2.tsarReportEntryNum
WHERE t1.railroad IN ('BNSF','BN','CBQ','GN','MILW')
AND t2.entrynumber IN (SELECT MAX(entrynumber) FROM locoreports GROUP BY locouid)
AND t2.date > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY t1.number ASC

The tables involves are as follows:

  • t1/locodata contains a unique entry for each locomotive, giving some basic information. t1 contains around 53,000 rows. t1 has its primary key on (uid, railroad).
  • t2/locoreports contains historic location reports for each locomotive, with each location report linked to the corresponding entry in t1. Each report has an associated datetime. t2 contains around 5,700,000 rows. t2 has its primary key on entrynumber, which is AUTO_INCREMENT, plus additional keys on locouid and tsarReportEntryNum.
  • t3/tsarreports contains train data. Each entry in t2 may link to an entry in t3; multiple entries in t2 can link to the same t3 entry. t3 contains around 783,000 rows. t3 has its primary key on entrynumber.

Using EXPLAIN, I am very surprised to see that the search on t2 does not use any key at all, despite one being listed as available. It falls back on a full table scan, which I expect is the slow part!

+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table       | partitions | type   | possible_keys    | key     | key_len | ref                                   | rows    | filtered | Extra                                        |
+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | t2          | NULL       | ALL    | locouid          | NULL    | NULL    | NULL                                  | 5980278 |    33.33 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | t1          | NULL       | ref    | PRIMARY,railroad | PRIMARY | 8       | fymtraintracker.t2.locouid            |       1 |    32.85 | Using where                                  |
|  1 | PRIMARY     | t3          | NULL       | eq_ref | PRIMARY          | PRIMARY | 4       | fymtraintracker.t2.tsarReportEntryNum |       1 |   100.00 | NULL                                         |
|  2 | SUBQUERY    | locoreports | NULL       | range  | locouid          | locouid | 8       | NULL                                  |   38646 |   100.00 | Using index for group-by                     |
+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+

So, some queries:

  1. Am I missing an obvious key for the tables?
  2. Is one of the WHERE statements particularly horrible? The subquery in the second WHERE statement doesn't seem particularly nice, although the EXPLAIN output implies it's not so bad.

Best Answer

So you can re-write this query to address a couple things. The first being because you're filtering on fields from t2, you're actually implicitly reducing your results as if you were using an INNER JOIN as opposed to a LEFT JOIN from t1, with the same predicate. There are some minute reasons why INNER JOINS are more performant than OUTER JOINS as well, so might as well use an INNER JOIN here like so:

SELECT t1.railroad, t1.number, t1.typeid, t2.date, t2.yard, t2.yardmaster, t2.trainsymbol, t3.trainuid 
FROM locodata AS t1 
INNER JOIN locoreports AS t2 ON t2.locouid = t1.uid
LEFT JOIN tsarreports AS t3 ON t3.entrynumber = t2.tsarReportEntryNum
WHERE t1.railroad IN ('BNSF','BN','CBQ','GN','MILW')
AND t2.entrynumber IN (SELECT MAX(entrynumber) FROM locoreports GROUP BY locouid)
AND t2.date > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY t1.number ASC

Now that's definitely not going to fix all your performance problems, just something conceptually important to be aware of. Next thing I would suggest is try to avoid using subqueries in predicates (JOIN, WHERE, and HAVING clauses) as while not inherently always bad, generally can lead to suboptimal queries to be written as opposed to a more naturally relational equivalent. For example, in this case locoreports is by far your largest table (almost 6 million rows) and yet your query has to operate and reduce the data on it twice, when instead we can smartly get the rows with the latest entrynumber first and then join to only those rows like so:

SELECT t1.railroad, t1.number, t1.typeid, t2.date, t2.yard, t2.yardmaster, t2.trainsymbol, t3.trainuid 
FROM locodata AS t1 
INNER JOIN 
(
    SELECT date, yard, yardmaster, trainsymbol, locouid, tsarReportEntryNum, ROW_NUMBER() OVER (PARTITION BY locouid ORDER BY entrynumber DESC) AS SortId
    FROM locoreports AS t2 
    WHERE date > DATE_SUB(NOW(), INTERVAL 30 DAY)
) AS t2
ON t2.locouid = t1.uid
AND t2.SortId = 1
LEFT JOIN tsarreports AS t3 ON t3.entrynumber = t2.tsarReportEntryNum
WHERE t1.railroad IN ('BNSF','BN','CBQ','GN','MILW')
ORDER BY t1.number ASC

The above leverages the ROW_NUMBER() window function which is awesomely helpful for getting the max / min / top N rows of a grouping (PARTITION) however you want to order it, and generally is pretty efficient at doing so. ROW_NUMBER() will generate a unique ID for each row within the PARTITION (grouping) in the order you specify, this case on entrynumber descending. Then we can subsequently filter on only the first (latest) entrynumber in the JOIN predicates to the subquery with AND t2.SortId = 1. (Side note, you can change the 1 to any number and that's the N number of latest rows by t2.entrynumber descending you'd get back, but in your specific example you only want the very latest.)

This allowed us to remove the subquery on locoreports from your WHERE clause, and effectively we only need to process and reduce the data from locoreports once. I also moved up the filter on the date into the subquery as well. I wrote it in a subquery in the above example, though I normally like to refactor that stuff into a CTE just for readability, but I'm unsure of your MySQL version which may or may not support CTEs. Also, there may be other potential performance improvements you can try as well, but the above two things are what jumped out at me.