MySQL – Get the Most Recent 2 Entries for Each Group by Timestamp

greatest-n-per-groupMySQLmysql-5.7

I am using MySQL 5.7 database with the following columns

item_symbol
source_date
price

I have composite primary keys which are a combination of item_symbol and source_date. I have around 100k entries in the table.

My question here is, for each item_symbol, I would like to select item_symbol and price for the most recent (based on source date) 2 entries from the table.

An example of my table:

item_symbol	price	source_date
A	20.1	2021-06-10
A	18.2	2021-06-11
A	10.9	2021-06-13
A	21.0	2021-06-15
B	88.2	2021-06-10
B	60.9	2021-06-11
B	78.16	2021-06-13
B	79.0	2021-06-15

Best Answer

I went to my favourite MySQL "tips and tricks" site here and went to the common queries link and looked for the Top N per group section. The great thing about this site is that it tells you how to do stuff in MySQL for all versions - well, going back at least to MySQL 5.5 - and if you're still running that, well...

I came up with the following adapted from above (all of the DDL, DML and SQL below is available on the fiddle here):

I used the DDL and DML from @nbk, kudos to him (and +1):

CREATE TABLE item
(
  item_symbol CHAR(1), 
  price       DECIMAL(10,2), 
  source_date DATE
);

and populate it:

INSERT INTO item
(item_symbol, price, source_date)
VALUES
('A', 20.1, '2021-06-10'),
('A', 18.2, '2021-06-11'),
('A', 10.9, '2021-06-13'),
('A', 21.0, '2021-06-15'),
('B', 88.2, '2021-06-10'),
('B', 60.9, '2021-06-11'),
('B', 78.16, '2021-06-13'),
('B', 79.0, '2021-06-15');

MySQL allows the use of user variables which are a godsend when you don't have capabilities such as the ROW_NUMBER() window function which would have made this query trivial. I would strongly urge you to upgrade to version 8, it has many other goodies - CTEs, CHECK constraints...

Anyway, I'll demonstrate the steps, partly to explain them to you, and partly to explain them to myself! :-)

SELECT
  item_symbol, price, source_date,
  IF 
  (
    @prev <> item_symbol,
    @row_num := 1,
    @row_num := @row_num + 1
  ) AS my_rank,
  @prev := item_symbol
FROM item
JOIN 
  (
    SELECT @row_num := NULL, @prev := 0
  ) AS r
ORDER BY item_symbol, source_date DESC, price DESC;

Result:

item_symbol     price   source_date     my_rank     @prev := item_symbol
          A     21.00   2021-06-15            1     A
          A     10.90   2021-06-13            2     A
          A     18.20   2021-06-11            3     A
          A     20.10   2021-06-10            4     A
          B     79.00   2021-06-15            1     B
          B     78.16   2021-06-13            2     B
          B     60.90   2021-06-11            3     B
          B     88.20   2021-06-10            4     B
8 rows

So, we have the items ('A', 'B') order by date DESC (most recent first) with the price. Note that 12 lines of that query could be replaced by one ROW_NUMBER() function line!

So, now we wrap that in a query, pulling out those results whose my_rank value is <= 2 - which gives us the two most recent dates!

SELECT item_symbol, price, source_date, my_rank  -- this last one is not 
FROM                                             -- required - for clarity...
(
  SELECT
    item_symbol, price, source_date,
    IF 
    (
      @prev <> item_symbol,
      @row_num := 1,
      @row_num := @row_num + 1
    ) AS my_rank,
    @prev := item_symbol
    FROM item
    JOIN (SELECT @row_num := NULL, @prev := 0) AS r
    ORDER BY item_symbol, source_date DESC, price DESC  -- in case of ties!
) AS t
WHERE t.my_rank <= 2
ORDER BY item_symbol, source_date DESC; -- change this as required

Result:

item_symbol     price   source_date     my_rank
          A     21.00   2021-06-15            1
          A     10.90   2021-06-13            2
          B     79.00   2021-06-15            1
          B     78.16   2021-06-13            2

I would suggest that you spend some time browsing the artful softare site!

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Select first row in each GROUP BY group?

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
_{Old sqlfiddle}

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
_{OLD sqliddle}

Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / `JOIN LATERAL` / correlated subquery

Especially for data distributions with many prices per good:

Optimize GROUP BY query to retrieve latest record per user

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.

PostgreSQL – Get Last 5 Distinct Values for Each ID

There are probably many ways to do this. The first that comes to mind is to use window functions:

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;

Test at SQLfiddle.

If there are ties, say the 5th, 6th and 7th postcode for an id have the same date_created, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK() instead of ROW_NUMBER().

Another option is to use the LATERAL syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id and postcode), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.

SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;

Adding an index on (id, postcode, date_created) would be a good idea, too - or on (id, postcode, date_created DESC).