Problem
I have some assets in my application which are updated from time to time in an asynchronous fashion.
The example I'm gonna use here is Vehicles
. There are two tables:
Vehicles
: holds information about the vehicles themselvesVehicleUpdates
: holds information about all updates that happened for that vehicle.
The relevant parts of the table structure are:
CREATE TABLE `Vehicles` (
`id` varchar(50) NOT NULL,
`organizationId` varchar(50) NOT NULL,
`plate` char(7) NOT NULL,
`vehicleInfo` json DEFAULT NULL,
`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,
KEY `Vehicles_createdAt_idx` (`createdAt`),
);
CREATE TABLE `VehicleUpdates` (
`id` varchar(50) NOT NULL,
`organizationId` varchar(50) NOT NULL,
`vehiclePlate` char(7) NOT NULL,
`status` varchar(15) NOT NULL,
`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE
);
Now I have a new requirement in which I must return the latest update information along side the vehicle information itself.
Groupwise MAX Solutions
After digging a little, I've found this blog article. I then tried to use the suggested "uncorrelated subquery" approach, since it's deemed the best one:
Uncorrelated subquery
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 USING (organizationId, vehiclePlate, createdAt);
This query has an average execution time of 275 ms
in my production database.
I thought that was too slow, so I decided to give the "LEFT JOIN" approach a shot:
The Duds: LEFT JOIN
SELECT vu1.*
FROM VehicleUpdates AS vu1
LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL;
This one performed way better, with an average execution time of 40 ms
. Good enough for me.
Then I needed to run this query as part of the query on the Vehicles
table.
Current results
The following query would satisfy my requirements:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL;
The problem is that it takes 20 s
(!) to run. Huge problem!
But I never do a full table scan on production. The query is always limited to a single organizationId
and it's paged, so I return at most 100 lines per page, so I ran the following query:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
and v.organizationId = '<some organization ID>'
LIMIT 100;
Now it takes from 750 ms
to 11 s
to run, depending on how many vehicles are associated with. Not good enough.
Running explain
for the query above got me:
"select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 |
SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop)
SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index
What strikes me is that the vu1
table is running a full table scan, even though the left-most table Vehicles
is being filtered using an indexed column organizationId
, which is also indexed in VehicleUpdates
.
So I decided to give the "uncorrelated subquery" another try and ran:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 USING (organizationId, vehiclePlate, createdAt)
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
WHERE v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;
This time the execution time varies from 1.4 s
to 13 s
, depending on how many entries there was in the Vehicles
table for a given organizationId
. Unacceptable for my application.
Running explain
got me:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ALL | | | | | 14456 | 100 |
| PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by
Current Results – UPDATED
I noticed that adding the specific organizationId
clause can increase the performance.
LEFT JOIN
Running:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <--------
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
and v.organizationId = '<SOME ORGANIZATION ID>' -- <-----------
LIMIT 100;
I get execution times varying from 65 ms
(acceptable) to 2.5 s
(not acceptable).
Uncorrelated query
Putting a organizationId = '<SOME ORGANIZATION ID>'
clause in the "main" query and the join external subquery:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
I get execution times varying from 450 ms
(not acceptable) to 900 ms
(not acceptable).
Putting a organizationId = '<SOME ORGANIZATION ID>'
clause in the "main" query and the join internal subquery:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
I get execution times varying from 225 ms
(acceptable) to 500 ms
(not acceptable).
Are there any better ways of handling such query?
Database Information
- MySQL
- Version: 5.7.23-log (Amazon RDS)
- Engine: InnoDB
Best Answer
You are definitely on the right track with putting the
organizationId
into strategic parts of your query.Try this slightly modified version of your "uncorrelated subquery" query:
This version of the query avoids the intermediate
SELECT *
on the LEFT JOINed table, and so not have to lose speed with an intermediate result.The only difference between my query and yours is I dropped the
SELECT vu1.* FROM
andAS vu
from the query, and the updated all the occurrences ofvu
tovu1
.It is useful little trick that help out in situations where you need to use an INNER JOIN on a table that needs to be LEFT JOINed to the main query, which is exactly your case.