I am attempting to optimise a query that displays the last reported movement for a set railroad locomotives. The query currently takes 8-12s to run for a common query to generate a report on a webpage, which outputs around 5,000 rows.
The query involves the following steps:
- From the table
t1
containing all known locomotives, select those belonging to a specific railroad (or railroads) - From the table
t2
containing movement reports, select the most recent report (highest value ofentrynumber
) referencing each entry selected fromt1
- From those movement reports selected from
t2
, select only those in the last 30 days - Join the selected reports in
t2
to a matching entry int3
, if one exists;t2
will contain a NULL entry if not.
The end result is that for each entry in t1
, the most recent report in t2
is output, unless that report is more than 30 days old.
An example query is:
SELECT t1.railroad, t1.number, t1.typeid, t2.date, t2.yard, t2.yardmaster, t2.trainsymbol, t3.trainuid
FROM locodata AS t1
LEFT JOIN locoreports AS t2 ON t2.locouid = t1.uid
LEFT JOIN tsarreports AS t3 ON t3.entrynumber = t2.tsarReportEntryNum
WHERE t1.railroad IN ('BNSF','BN','CBQ','GN','MILW')
AND t2.entrynumber IN (SELECT MAX(entrynumber) FROM locoreports GROUP BY locouid)
AND t2.date > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY t1.number ASC
The tables involves are as follows:
t1
/locodata contains a unique entry for each locomotive, giving some basic information.t1
contains around 53,000 rows.t1
has its primary key on (uid
,railroad
).t2
/locoreports contains historic location reports for each locomotive, with each location report linked to the corresponding entry int1
. Each report has an associated datetime.t2
contains around 5,700,000 rows.t2
has its primary key onentrynumber
, which isAUTO_INCREMENT
, plus additional keys onlocouid
andtsarReportEntryNum
.t3
/tsarreports contains train data. Each entry int2
may link to an entry int3
; multiple entries int2
can link to the samet3
entry.t3
contains around 783,000 rows.t3
has its primary key onentrynumber
.
Using EXPLAIN
, I am very surprised to see that the search on t2
does not use any key at all, despite one being listed as available. It falls back on a full table scan, which I expect is the slow part!
+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | t2 | NULL | ALL | locouid | NULL | NULL | NULL | 5980278 | 33.33 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | t1 | NULL | ref | PRIMARY,railroad | PRIMARY | 8 | fymtraintracker.t2.locouid | 1 | 32.85 | Using where |
| 1 | PRIMARY | t3 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | fymtraintracker.t2.tsarReportEntryNum | 1 | 100.00 | NULL |
| 2 | SUBQUERY | locoreports | NULL | range | locouid | locouid | 8 | NULL | 38646 | 100.00 | Using index for group-by |
+----+-------------+-------------+------------+--------+------------------+---------+---------+---------------------------------------+---------+----------+----------------------------------------------+
So, some queries:
- Am I missing an obvious key for the tables?
- Is one of the
WHERE
statements particularly horrible? The subquery in the secondWHERE
statement doesn't seem particularly nice, although theEXPLAIN
output implies it's not so bad.
Best Answer
So you can re-write this query to address a couple things. The first being because you're filtering on fields from
t2
, you're actually implicitly reducing your results as if you were using anINNER JOIN
as opposed to aLEFT JOIN
fromt1
, with the same predicate. There are some minute reasons whyINNER JOINS
are more performant thanOUTER JOINS
as well, so might as well use anINNER JOIN
here like so:Now that's definitely not going to fix all your performance problems, just something conceptually important to be aware of. Next thing I would suggest is try to avoid using subqueries in predicates (
JOIN
,WHERE
, andHAVING
clauses) as while not inherently always bad, generally can lead to suboptimal queries to be written as opposed to a more naturally relational equivalent. For example, in this caselocoreports
is by far your largest table (almost 6 million rows) and yet your query has to operate and reduce the data on it twice, when instead we can smartly get the rows with the latestentrynumber
first and then join to only those rows like so:The above leverages the
ROW_NUMBER()
window function which is awesomely helpful for getting the max / min / top N rows of a grouping (PARTITION
) however you want to order it, and generally is pretty efficient at doing so.ROW_NUMBER()
will generate a unique ID for each row within thePARTITION
(grouping) in the order you specify, this case onentrynumber
descending. Then we can subsequently filter on only the first (latest)entrynumber
in theJOIN
predicates to the subquery withAND t2.SortId = 1
. (Side note, you can change the 1 to any number and that's the N number of latest rows byt2.entrynumber
descending you'd get back, but in your specific example you only want the very latest.)This allowed us to remove the subquery on
locoreports
from yourWHERE
clause, and effectively we only need to process and reduce the data fromlocoreports
once. I also moved up the filter on thedate
into the subquery as well. I wrote it in a subquery in the above example, though I normally like to refactor that stuff into a CTE just for readability, but I'm unsure of your MySQL version which may or may not support CTEs. Also, there may be other potential performance improvements you can try as well, but the above two things are what jumped out at me.