Mysql – Get top row in multiple dimensions

greatest-n-per-groupinnodbMySQL

I have a table of jobs that has columns for unique id, app, env, the status of the job and the timestamp for it, along with other data not relevant to the question. Like so,

CREATE TABLE the_table (
    id       int           AUTO_INCREMENT,
    app      varchar(255),
    env      varchar(255),
    time_end datetime(3),
    status   varchar(255),
    PRIMARY KEY (id)
);

Each job belongs to an app in en environment, env, and the values for these are limited, currently around ten for each of app and env. Some example data (with changed values):

INSERT INTO the_table(app,env,time_end,status)
VALUES
    ( 'app1', 'env1', '2017-11-14 14:45:07.037', 'finished' ),
    ( 'app1', 'env2', '2017-08-01 08:30:03.951', 'finished' ),
    ( 'app1', 'env1', '2017-10-05 21:36:22.716', 'finished' ),
    ( 'app2', 'env2', '2017-05-18 11:26:58.122', 'finished' ),
    ( 'app1', 'env1', '2017-09-26 00:35:37.914', 'finished' ),
    ( 'app1', 'env3', '2017-11-05 20:27:00.762', 'finished' ),
    ( 'app1', 'env3', '2017-06-04 16:55:19.089', 'finished' ),
    ( 'app3', 'env3', '2017-10-05 04:36:00.820', 'finished' ),
    ( 'app3', 'env3', '2017-09-22 04:11:51.602', 'finished' ),
    ( 'app2', 'env1', '2017-10-09 15:36:38.642', 'finished' ),
    ( 'app1', 'env2', '2017-10-04 16:07:48.272', 'finished' ),
    ( 'app1', 'env1', '2017-09-16 15:35:22.239', 'finished' ),
    ( 'app1', 'env1', '2017-10-12 19:53:29.434', 'finished' ),
    ( 'app1', 'env1', '2017-08-23 07:07:56.983', 'finished' );

I want to get, for each combination of app and env, and status=finished, the latest record. My first solution would be something like:

xs = Query("SELECT DISTINCT app FROM the_table")
ys = Query("SELECT DISTINCT env FROM the_table")


for app in apps do{
  for env in envs do{
    r = Query("SELECT * FROM the_table WHERE status='finished' AND the_table.env=env AND the_table.env=env ORDER BY time_end desc LIMIT 1")
  }
}

This could potentially be a lot of queries though and I would like to find another solution more idiomatic to SQL. This is using MySQL 5.7, so I can't use limit in subqueries.

Best Answer

If you wanted only the last time_end value per (app, env), you could just use GROUP BY:

SELECT
  app,
  env,
  MAX(time_end) AS last_time_end
FROM
  the_table
WHERE
  status = 'finished'
GROUP BY
  app,
  env
;

However, you want in each case the entire row where time_end is last. That is a classic "greatest N per group" problem. One way to solve it is in two steps: first get the last values themselves, then join them back to the original table and get the corresponding complete rows. The query above, therefore, is the first step. Use it as a derived table and join it back to the_table to get the final output:

SELECT
  t.*
FROM
  the_table AS t
  INNER JOIN
  (
    SELECT
      app,
      env,
      MAX(time_end) AS last_time_end
    FROM
      the_table
    WHERE
      status = 'finished'
    GROUP BY
      app,
      env
  ) AS s ON t.app = s.app
        AND t.env = s.env
        AND t.time_end = s.last_time_end
;

Other database products support more advanced SQL syntax – specifically, window functions – that allows you to solve this category of problems without using a join, thus offering better performance. MySQL does not support window functions (yet?) but there is a way to work around that limitation using variables.

Specifically, you would use variables to rank rows within each (app, env) group based on the time_end value and then filter the rows on the ranking value:

SELECT
  derived.*
FROM
  (
    SELECT
      @rank := (app = @app AND env = @env) * @rank + 1 AS rank,
      ID,
      @app := app AS app,
      @env := env AS env,
      time_end
    FROM
      (SELECT @app := '', @env := '', @rank := 0) AS x,
      the_table
    WHERE
      status = 'finished'
    ORDER BY
      app ASC,
      env ASC,
      time_end DESC
  ) AS derived
WHERE
  derived.rank = 1
;

The query works like this:

The rows are sorted in such a way that identical (app, env) rows are arranged in groups, the last time_end row being the first row in each group.
Each row is then assigned a ranking value with the help of variables. Variables @app and @env hold the previous row's values of app and env and take part in calculating the @rank value: if the current row's app and env match the previous row's values, the @rank value is increased by 1, otherwise it is set to 1.

Thus, the first row in each group always gets the value of 1, because the previous row belongs to a different group or, in the case of the very first group, is simply non-existent (@app and @env in that case each hold the initial value, an empty string).
The results are used as a derived table and are filtered on the ranking value: only the rows with the ranking value of 1 are returned (they hold the last time_end value in each group).

My only issue with this method is that it is very MySQL-specific, which, in my opinion, is worth keeping in mind. No other RDBMS supports this syntax (unless that product is a MySQL fork, like MariaDB, for instance).

Both methods shown can be found and played with in this live demo at Rextester.

Related Solutions

Mysql int vs varchar as primary key (InnoDB Storage Engine

I follow the following rules for primary keys:

a) Should not have any business meaning - they should be totally independent of the application you are developing, therefore I go for numeric auto generated integers. However if you need additional columns to be unique then create unique indexes to support that

b) Should perform in joins - joining to varchars vs integers is about 2x to 3x slower as the length of the primary key grows, so you want to have your keys as integers. Since all computer systems are binary, I suspect its coz the string is changed to binary then compared with the others which is very slow

c) Use the smallest data type possible - if you expect your table to have very few columns say 52 US states, then use the smallest type possible maybe a CHAR(2) for the 2 digit code, but I would still go for a tinyint (128) for the column vs a big int which can go up to 2billion

Also you will have a challenge with cascading your changes from the primary keys to the other tables if for example the project name changes (which is not uncommon)

Go for sequential auto incrementing integers for your primary keys and gain the inbuilt efficiencies that database systems provide with support for changes in the future

MySQL: select a row if that row is equal for every user

If there are 5 rows per month (1 for every user), this will do:

SELECT m.year, m.month 
FROM month AS m
GROUP BY m.year, m.month
HAVING SUM(finished) = 5 ;

Best Answer

Related Solutions

Mysql int vs varchar as primary key (InnoDB Storage Engine

MySQL: select a row if that row is equal for every user

Related Question