MySQL Performance – Resolving Issues with Groupwise MAX in JOIN Clause

indexinnodbMySQLmysql-5.7

Problem

I have some assets in my application which are updated from time to time in an asynchronous fashion.

The example I'm gonna use here is Vehicles. There are two tables:

  • Vehicles: holds information about the vehicles themselves
  • VehicleUpdates: holds information about all updates that happened for that vehicle.

The relevant parts of the table structure are:

CREATE TABLE `Vehicles` (
  `id` varchar(50) NOT NULL,
  `organizationId` varchar(50) NOT NULL,
  `plate` char(7) NOT NULL,
  `vehicleInfo` json DEFAULT NULL,
  `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,
  KEY `Vehicles_createdAt_idx` (`createdAt`),
);

CREATE TABLE `VehicleUpdates` (
  `id` varchar(50) NOT NULL,
  `organizationId` varchar(50) NOT NULL,
  `vehiclePlate` char(7) NOT NULL,
  `status` varchar(15) NOT NULL,
  `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE
);

Now I have a new requirement in which I must return the latest update information along side the vehicle information itself.

Groupwise MAX Solutions

After digging a little, I've found this blog article. I then tried to use the suggested "uncorrelated subquery" approach, since it's deemed the best one:

Uncorrelated subquery

SELECT  vu1.*
    FROM VehicleUpdates AS vu1
    JOIN
      ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
            FROM  VehicleUpdates
            GROUP BY organizationId, vehiclePlate
      ) AS vu2 USING (organizationId, vehiclePlate, createdAt);

This query has an average execution time of 275 ms in my production database.

I thought that was too slow, so I decided to give the "LEFT JOIN" approach a shot:

The Duds: LEFT JOIN

SELECT  vu1.*
    FROM  VehicleUpdates AS vu1
    LEFT JOIN  VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate
      AND  vu2.createdAt > vu1.createdAt
    WHERE  vu2.id IS NULL;

This one performed way better, with an average execution time of 40 ms. Good enough for me.

Then I needed to run this query as part of the query on the Vehicles table.

Current results

The following query would satisfy my requirements:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND  vu2.createdAt > vu1.createdAt
WHERE  vu2.id IS NULL;

The problem is that it takes 20 s (!) to run. Huge problem!

But I never do a full table scan on production. The query is always limited to a single organizationId and it's paged, so I return at most 100 lines per page, so I ran the following query:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND  vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
    and v.organizationId = '<some organization ID>'
LIMIT 100;

Now it takes from 750 ms to 11 s to run, depending on how many vehicles are associated with. Not good enough.

Running explain for the query above got me:

"select_type" | "table" | "type" | "possible_keys"                                        | "key"                                      | "key_len" | "ref"                               | "rows" | "filtered" | "Extra"
SIMPLE        | v       | ref    | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx               | "202"     | const                               | 30     | 100        |
SIMPLE        | vu1     | ALL    |                                                        |                                            |           |                                     | 263171 | 100        | Using where; Using join buffer (Block Nested Loop)
SIMPLE        | vu2     | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx             | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173"     | vu1.organizationId,vu1.vehiclePlate | 10     | 10         | Using where; Not exists; Using index

What strikes me is that the vu1 table is running a full table scan, even though the left-most table Vehicles is being filtered using an indexed column organizationId, which is also indexed in VehicleUpdates.

So I decided to give the "uncorrelated subquery" another try and ran:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 USING (organizationId, vehiclePlate, createdAt)
) AS vu 
    ON vu.organizationId = v.organizationId 
        AND vu.vehiclePlate = v.plate
WHERE v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;

This time the execution time varies from 1.4 s to 13 s, depending on how many entries there was in the Vehicles table for a given organizationId. Unacceptable for my application.

Running explain got me:

| "select_type" | "table"        | "type" | "possible_keys"                            | "key"                                      | "key_len" | "ref"                                             | "rows" | "filtered" | "Extra"
| PRIMARY       | v              | ALL    |                                            |                                            |           |                                                   | 14456  | 100        |
| PRIMARY       | <derived3>     | ALL    |                                            |                                            |           |                                                   | 29289  | 100        | Using where
| PRIMARY       | vu1            | ref    | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327"     | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1      | 100        | Using where
| DERIVED       | VehicleUpdates | range  | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323"     |                                                   | 29289  | 100        | Using index for group-by

Current Results – UPDATED

I noticed that adding the specific organizationId clause can increase the performance.

LEFT JOIN

Running:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
    ON v.plate = vu1.vehiclePlate 
        AND v.organizationId = vu1.organizationId
        AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <--------
LEFT JOIN  VehicleUpdates AS vu2 
    ON vu1.organizationId = vu2.organizationId 
        AND vu1.vehiclePlate = vu2.vehiclePlate
        AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
    and v.organizationId = '<SOME ORGANIZATION ID>' -- <-----------
LIMIT 100;

I get execution times varying from 65 ms (acceptable) to 2.5 s (not acceptable).

Uncorrelated query

Putting a organizationId = '<SOME ORGANIZATION ID>' clause in the "main" query and the join external subquery:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 ON vu1.organizationId = vu2.organizationId
                and vu1.vehiclePlate = vu2.vehiclePlate
                and vu1.createdAt = vu2.createdAt
        WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
    ) AS vu 
        ON vu.organizationId = v.organizationId 
            AND vu.vehiclePlate = v.plate
where
    v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;

I get execution times varying from 450 ms (not acceptable) to 900 ms (not acceptable).

Putting a organizationId = '<SOME ORGANIZATION ID>' clause in the "main" query and the join internal subquery:

SELECT  v.*, vu.*
FROM  Vehicles AS v
LEFT JOIN (
    SELECT  vu1.*
        FROM VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 ON vu1.organizationId = vu2.organizationId
                and vu1.vehiclePlate = vu2.vehiclePlate
                and vu1.createdAt = vu2.createdAt
    ) AS vu 
        ON vu.organizationId = v.organizationId 
            AND vu.vehiclePlate = v.plate
where
    v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;

I get execution times varying from 225 ms (acceptable) to 500 ms (not acceptable).


Are there any better ways of handling such query?

Database Information

  • MySQL
  • Version: 5.7.23-log (Amazon RDS)
  • Engine: InnoDB

Best Answer

You are definitely on the right track with putting the organizationId into strategic parts of your query.

Try this slightly modified version of your "uncorrelated subquery" query:

SELECT  v.*, vu1.*
FROM  Vehicles AS v
LEFT JOIN (
        VehicleUpdates AS vu1
        JOIN
          ( SELECT  vehiclePlate, organizationId, MAX(createdAt) AS createdAt
                FROM  VehicleUpdates
                WHERE organizationId = 'A100'
                GROUP BY organizationId, vehiclePlate
          ) AS vu2 ON vu1.organizationId = vu2.organizationId
                and vu1.vehiclePlate = vu2.vehiclePlate
                and vu1.createdAt = vu2.createdAt
    )
        ON vu1.organizationId = v.organizationId 
            AND vu1.vehiclePlate = v.plate
where
    v.organizationId = 'A100'
LIMIT 100;

This version of the query avoids the intermediate SELECT * on the LEFT JOINed table, and so not have to lose speed with an intermediate result.

The only difference between my query and yours is I dropped the SELECT vu1.* FROM and AS vu from the query, and the updated all the occurrences of vu to vu1.

It is useful little trick that help out in situations where you need to use an INNER JOIN on a table that needs to be LEFT JOINed to the main query, which is exactly your case.