Mysql – Get top row in multiple dimensions

greatest-n-per-groupinnodbMySQL

I have a table of jobs that has columns for unique id, app, env, the status of the job and the timestamp for it, along with other data not relevant to the question. Like so,

CREATE TABLE the_table (
    id       int           AUTO_INCREMENT,
    app      varchar(255),
    env      varchar(255),
    time_end datetime(3),
    status   varchar(255),
    PRIMARY KEY (id)
);

Each job belongs to an app in en environment, env, and the values for these are limited, currently around ten for each of app and env. Some example data (with changed values):

INSERT INTO the_table(app,env,time_end,status)
VALUES
    ( 'app1', 'env1', '2017-11-14 14:45:07.037', 'finished' ),
    ( 'app1', 'env2', '2017-08-01 08:30:03.951', 'finished' ),
    ( 'app1', 'env1', '2017-10-05 21:36:22.716', 'finished' ),
    ( 'app2', 'env2', '2017-05-18 11:26:58.122', 'finished' ),
    ( 'app1', 'env1', '2017-09-26 00:35:37.914', 'finished' ),
    ( 'app1', 'env3', '2017-11-05 20:27:00.762', 'finished' ),
    ( 'app1', 'env3', '2017-06-04 16:55:19.089', 'finished' ),
    ( 'app3', 'env3', '2017-10-05 04:36:00.820', 'finished' ),
    ( 'app3', 'env3', '2017-09-22 04:11:51.602', 'finished' ),
    ( 'app2', 'env1', '2017-10-09 15:36:38.642', 'finished' ),
    ( 'app1', 'env2', '2017-10-04 16:07:48.272', 'finished' ),
    ( 'app1', 'env1', '2017-09-16 15:35:22.239', 'finished' ),
    ( 'app1', 'env1', '2017-10-12 19:53:29.434', 'finished' ),
    ( 'app1', 'env1', '2017-08-23 07:07:56.983', 'finished' );

I want to get, for each combination of app and env, and status=finished, the latest record. My first solution would be something like:

xs = Query("SELECT DISTINCT app FROM the_table")
ys = Query("SELECT DISTINCT env FROM the_table")


for app in apps do{
  for env in envs do{
    r = Query("SELECT * FROM the_table WHERE status='finished' AND the_table.env=env AND the_table.env=env ORDER BY time_end desc LIMIT 1")
  }
}

This could potentially be a lot of queries though and I would like to find another solution more idiomatic to SQL. This is using MySQL 5.7, so I can't use limit in subqueries.

Best Answer

If you wanted only the last time_end value per (app, env), you could just use GROUP BY:

SELECT
  app,
  env,
  MAX(time_end) AS last_time_end
FROM
  the_table
WHERE
  status = 'finished'
GROUP BY
  app,
  env
;

However, you want in each case the entire row where time_end is last. That is a classic "greatest N per group" problem. One way to solve it is in two steps: first get the last values themselves, then join them back to the original table and get the corresponding complete rows. The query above, therefore, is the first step. Use it as a derived table and join it back to the_table to get the final output:

SELECT
  t.*
FROM
  the_table AS t
  INNER JOIN
  (
    SELECT
      app,
      env,
      MAX(time_end) AS last_time_end
    FROM
      the_table
    WHERE
      status = 'finished'
    GROUP BY
      app,
      env
  ) AS s ON t.app = s.app
        AND t.env = s.env
        AND t.time_end = s.last_time_end
;

Other database products support more advanced SQL syntax – specifically, window functions – that allows you to solve this category of problems without using a join, thus offering better performance. MySQL does not support window functions (yet?) but there is a way to work around that limitation using variables.

Specifically, you would use variables to rank rows within each (app, env) group based on the time_end value and then filter the rows on the ranking value:

SELECT
  derived.*
FROM
  (
    SELECT
      @rank := (app = @app AND env = @env) * @rank + 1 AS rank,
      ID,
      @app := app AS app,
      @env := env AS env,
      time_end
    FROM
      (SELECT @app := '', @env := '', @rank := 0) AS x,
      the_table
    WHERE
      status = 'finished'
    ORDER BY
      app ASC,
      env ASC,
      time_end DESC
  ) AS derived
WHERE
  derived.rank = 1
;

The query works like this:

  1. The rows are sorted in such a way that identical (app, env) rows are arranged in groups, the last time_end row being the first row in each group.

  2. Each row is then assigned a ranking value with the help of variables. Variables @app and @env hold the previous row's values of app and env and take part in calculating the @rank value: if the current row's app and env match the previous row's values, the @rank value is increased by 1, otherwise it is set to 1.

    Thus, the first row in each group always gets the value of 1, because the previous row belongs to a different group or, in the case of the very first group, is simply non-existent (@app and @env in that case each hold the initial value, an empty string).

  3. The results are used as a derived table and are filtered on the ranking value: only the rows with the ranking value of 1 are returned (they hold the last time_end value in each group).

My only issue with this method is that it is very MySQL-specific, which, in my opinion, is worth keeping in mind. No other RDBMS supports this syntax (unless that product is a MySQL fork, like MariaDB, for instance).

Both methods shown can be found and played with in this live demo at Rextester.