MySQL – Data for the Latest Previous Effective Date Subquery

mariadbMySQL

Imagine issues with linked

Time entries to track the time spent
Journal entries to track progression evolution

(Actually a part of Redmine schema)

Let's take some sample data:

Issues

id	subject	estimated_hours
1	First issue	4
2	Second issue	8

Time entries

id	issue_id	spent_on	hours
1	1	2020-12-02	1
2	1	2020-12-05	1.5
3	2	2020-12-06	0.5
4	1	2020-12-10	1
5	2	2020-12-10	1
6	2	2020-12-15	2
7	2	2020-12-15	1

Journal

id	issue_id	updated_on	done_ratio
1	1	2020-11-20	0
2	2	2020-11-20	5
3	1	2020-12-01	20
4	1	2020-12-03	40
5	2	2020-12-06	30
6	1	2020-12-07	70
7	1	2020-12-11	100
8	2	2020-12-12	50

Now for each time entry I need to know the latest done ratio.
Here is what I'm expecting

id	subject	spent_on	hours	done_ratio
1	First issue	2020-12-02	1	20
2	First issue	2020-12-05	1.5	40
3	Second issue	2020-12-06	0.5	30
4	First issue	2020-12-10	1	70
5	Second issue	2020-12-10	1	30
6	Second issue	2020-12-15	2	50
7	Second issue	2020-12-15	1	50

I thought I could get that with this query:

SELECT
  tie.id
, iss.subject
, tie.spent_on
, tie.hours
, jou.done_ratio
FROM time_entry AS tie
JOIN issue AS iss ON (tie.issue_id = iss.id)
JOIN (
  SELECT
    issue_id
  , updated_on
  , done_ratio
  FROM journal
  ORDER BY updated_on DESC
) AS jou ON (
  jou.issue_id = tie.issue_id
  AND jou.updated_on <= tie.spent_on
)
GROUP BY tie.id, tie.spent_on
ORDER BY tie.spent_on

But it's returning the first done ratio for every time entry.

Here is a playground to fiddle with the mentioned data

How can I get the done ratio of the latest journal prior to (<=) each time entry ?

I'm using mariadb:10.2.9 here but this should pretty much be standard ANSI SQL

Best Answer

To solve your problem, I did the following:

Took your fiddle and put it into dbfiddle.uk - the best one around with the most servers! I used MariaDB 10.3 (10.2 isn't available on dbfiddle) and I also used PostgreSQL - the most standards compliant F/LOSS server around (possibly along with Firebird). MySQL and its derivatives are not great - I presume that fixing some of the bugs/inconsistencies would break a lot of systems now - see the discussion below on ONLY_FULL_GROUP_BY).

The MariaDB fiddle is here and the PostgreSQL one is here if you're interested.

I'll show some SQL that demonstrates my train of thought and then the final (correct) result. I'm a big fan of Window functions in SQL and I thought that I could use them to answer this question. They are really powerful and well worth getting to know!

So, I was attempting SQL like this:

SELECT 
  i.subject AS i_sub, 
  te.id AS te_id, te.issue_id AS te_is, te.spent_on AS te_so, te.hours AS te_hr, 
  -- ROW_NUMBER() OVER (PARTITION BY te.spent_on ORDER BY te.spent_on, j.updated_on, te.id, j.id) AS rn_je,
  j.id AS j_id, j.issue_id AS j_iss_id, j.updated_on AS j_upd, j.done_ratio AS j_dr,
  -- ROW_NUMBER() OVER (PARTITION BY j.updated_on ORDER BY j.updated_on ASC, j.done_ratio) AS rn_j,
  ROW_NUMBER() OVER (PARTITION BY te.id ORDER BY te.id, j.done_ratio DESC) AS rn_te_id,
  ROW_NUMBER() OVER (PARTITION BY j.id ORDER BY j.id) AS rn_j_id,
  ROW_NUMBER() OVER (PARTITION BY j.done_ratio ORDER BY j.done_ratio) AS rn_j_dr,
  ROW_NUMBER() OVER (PARTITION BY te.spent_on ORDER BY te.spent_on) AS rn_te_so,
  ROW_NUMBER() OVER (PARTITION BY j.updated_on ORDER BY te.spent_on) AS rn_j_so
FROM time_entry te
JOIN journal j ON te.issue_id = j.issue_id 
JOIN issue i ON te.issue_id = i.id
WHERE j.updated_on <= te.spent_on
ORDER BY te.spent_on, j.updated_on, te.id, j.id;

Result (better viewed in the fiddle):

i_sub   te_id   te_is   te_so   te_hr   j_id    j_iss_id    j_upd   j_dr    rn_te_id    rn_j_id rn_j_dr rn_te_so    rn_j_so
First issue 1   1   2020-12-02  1   3   1   2020-12-01  20  1   1   1   1   1
First issue 2   1   2020-12-05  1.5 3   1   2020-12-01  20  2   3   3   2   2
First issue 2   1   2020-12-05  1.5 4   1   2020-12-03  40  1   2   2   1   1
Second issue    3   2   2020-12-06  0.5 5   2   2020-12-06  30  1   3   3   1   1
First issue 4   1   2020-12-10  1   3   1   2020-12-01  20  3   2   2   4   3
First issue 4   1   2020-12-10  1   4   1   2020-12-03  40  2   1   1   1   2
Second issue    5   2   2020-12-10  1   5   2   2020-12-06  30  1   2   2   2   2
First issue 4   1   2020-12-10  1   6   1   2020-12-07  70  1   1   1   3   1
Second issue    6   2   2020-12-15  2   5   2   2020-12-06  30  2   1   1   1   3
Second issue    7   2   2020-12-15  1   5   2   2020-12-06  30  2   4   4   3   4
Second issue    6   2   2020-12-15  2   8   2   2020-12-12  50  1   1   1   2   1
Second issue    7   2   2020-12-15  1   8   2   2020-12-12  50  1   2   2   4   2

As you can see, it took considerable experimentation (not to mention most of the night, much to my financée's displeasure...) to find a unique combination of fields using ROW_NUMBER(), but I eventually found it - using

ROW_NUMBER() OVER (PARTITION BY te.id ORDER BY te.id, j.done_ratio DESC)

does the trick as follows:

SELECT 
  ROW_NUMBER() OVER () AS "Id", i_sub AS "Subject", te_so AS "Spent on", te_hr AS "Hours",
  j_dr AS "Done ratio" 
FROM
(
  SELECT 
    i.subject AS i_sub, 
    te.id AS te_id, te.issue_id AS te_is, te.spent_on AS te_so, te.hours AS te_hr, 
    j.id AS j_id, j.issue_id AS j_iss_id, j.updated_on AS j_upd, j.done_ratio AS j_dr,
    ROW_NUMBER() OVER (PARTITION BY te.id ORDER BY te.id, j.done_ratio DESC) AS rn_te_id  FROM time_entry te
  JOIN journal j ON te.issue_id = j.issue_id 
  JOIN issue i ON te.issue_id = i.id
  WHERE j.updated_on <= te.spent_on
  ORDER BY te.spent_on, j.updated_on, te.id, j.id 
) AS t
WHERE rn_te_id = 1
ORDER BY te_so, te_id;

Result:

Id      Subject  Spent on    Hours  Done ratio
1   First issue  2020-12-02    1            20
2   First issue  2020-12-05    1.5          40
3   Second issu  2020-12-06    0.5          30
4   First issue  2020-12-10    1            70
5   Second issue 2020-12-10    1            30
6   Second issue 2020-12-15    2            50
7   Second issue 2020-12-15    1            50

which is as requested! QED - et voilà!

Just for kicks, I decided to look at the performance of my own and the other answers to your question. This is best done using MySQL 8 (version >= 8.0.18) which has the new EXPLAIN ANALYZE functionality, which is a vast improvement over the old EXPLAIN EXTENDED - it's much closer to PostgreSQL's functionality now (although PG is still by far and away the superior database).

You can check these out for yourself in the fiddle - it's a lot of text! Furthermore, unfortunately, the elapsed time for a query functionality covered in this article isn't returned by dbfiddle - and I don't have a running instance of MySQL at the minute. The plans are rather cryptic unless you're an expert, but a decent rule of thumb is that fewer lines mean better performance!

My own query appears to do rather poorly - but it would take (and this applies to the other solutions also) testing with a realistic dataset (plus with proper PRIMARY and FOREIGN KEYs) to be able to pronounce definitively on the optimal solution.

Your first query fails under MySQL 8:

Expression #5 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'jou.done_ratio' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

also under PostgreSQL:

ERROR: column "iss.subject" must appear in the GROUP BY clause or be used in an aggregate function LINE 6: , iss.subject ^

The ONLY_FULL_GROUP_BY problem in the MySQL error message is well known - by not having it enabled, you are running a very high risk of your query not returning correct results - it may work for small test cases, but then you'll start to obtain (ahem...) anomalous results (i.e. just plain WRONG!). Enable it on all of your systems NOW - it's enabled by default on MySQL 8, but not on any of the MariaDBs on dbfiddle.uk. I cannot stress strongly enough how important it is that you make this change!

Your second query works under version 8, but also has the biggest (most lines) in the plan, so mightn't have optimal performance.

@bbaird's plans are the best (i.e. smallest) - the best one being:

SELECT
  id, subject, spent_on, hours, done_ratio
FROM
  (
    SELECT
      time_entry.id, issue.subject, time_entry.spent_on, time_entry.hours, 
 journal.done_ratio,
     ROW_NUMBER() OVER (PARTITION BY time_entry.id ORDER BY journal.updated_on DESC) AS rownum
    FROM
      time_entry time_entry 
    INNER JOIN
      issue issue
        ON issue.id = time_entry.issue_id
    LEFT JOIN
      journal journal
        ON journal.issue_id = time_entry.issue_id
             AND journal.updated_on  <= time_entry.spent_on
  ) x
WHERE
  rownum = 1

which gives the plan:

-> Left hash join (journal.issue_id = issue.id), (journal.updated_on = (select #2))  (cost=2.38 rows=16) (actual time=0.060..0.211 rows=7 loops=1)
    -> Inner hash join (time_entry.issue_id = issue.id)  (cost=2.10 rows=2) (actual time=0.017..0.024 rows=7 loops=1)
        -> Table scan on time_entry  (cost=0.18 rows=7) (actual time=0.004..0.007 rows=7 loops=1)
        -> Hash
            -> Table scan on issue  (cost=0.45 rows=2) (actual time=0.004..0.006 rows=2 loops=1)
    -> Hash
        -> Table scan on journal  (cost=0.53 rows=8) (actual time=0.012..0.015 rows=8 loops=1)

which is very compact relative to the others. However, my remarks about realistic testing and proper indexing still apply!

+1 pour une excellente question qui m'a obligé à beaucoup réfléchir (avec fiddle en plus!). p.s. bienvenu sur le forum!

Best Answer

Related Solutions

Mysql – A query for a latest version of data

MySQL – How to Use a Column’s Latest Date in a Query

Related Question