Postgresql – Get rows with most recent date for each different item

greatest-n-per-grouppostgresql

Let's say this is the sample date coming from a join of 2 tables. Database is Postgres 9.6

id  product_id  invoice_id  amount       date
1    PROD1       INV01       2          01-01-2018
2    PROD2       INV02       3          01-01-2018
3    PROD1       INV01       2          05-01-2018
4    PROD1       INV03       1          05-01-2018
5    PROD2       INV02       3          08-01-2018
6    PROD2       INV04       4          08-01-2018

I want to know if it's possible in a optimized way to:

Get all the PRODx with their respective INVx which have the latest date, but per product_id. Please note that records unused from a day may be reported to a new one.
This means:

id  product_id  invoice_id  amount       date
3    PROD1       INV01       2          05-01-2018
4    PROD1       INV03       1          05-01-2018
5    PROD2       INV02       3          08-01-2018
6    PROD2       INV04       4          08-01-2018

Get daily summed amounts for each PRODx but fill the gaps with the previous ones if day does not exist.

This means:

 product_id    amount       date
   PROD1         2          01-01-2018
   PROD2         3          01-01-2018
   PROD1         2          02-01-2018
   PROD2         3          02-01-2018
   PROD1         2          03-01-2018
   PROD2         3          03-01-2018
   PROD1         2          04-01-2018
   PROD2         3          04-01-2018
   PROD1         3          05-01-2018
   PROD2         3          05-01-2018
   PROD1         3          06-01-2018
   PROD2         3          06-01-2018
   PROD1         3          07-01-2018
   PROD2         3          07-01-2018
   PROD1         3          08-01-2018
   PROD2         7          08-01-2018

A few thoughts:

For first question I could obtain the max(date) for each PRODx and the pick for each PRODx the rows that have the date=with max(date) but I was wondering if there's faster way to obtain this given a large number of recors in the database
For the second question, I could generate a series of dates for the interval needed and then use WITH rows As and do the query grouping by product_id and sum by amount and then select for each date the previous values from rows with a limit 1 but that does not sound that optimized either.

Looking forward for any input. Thank you.

Later edit:
Trying to give DISTINCT ON () a try.

If I have distinct on(product_id, invoice_id) then I don't get only the most recent ones for the most recent date. If there were invoice_ids in the the past, beside the latest date, then they will be returned
If I have distinct on (product_id) then it returns from the most recent date, but as normal, only the last rows even if in the last day I have two positions for PROD1.

Basically I need something like 'I need for the most recent date, all the product_ids and their invoice_ids while keeping in mind that a product_id can have multiple invoice_ids'

Later edit 2:

Running a query like for first question seems to be reasonably fast:

select product_id, invoice_id, amount
from mytable inner join myOtherTable on...
             inner join (select max(date) as last_date, product_id 
                         from mytable 
                         group by product_id) sub on mytable.date = 
                         sub.last_date

Best Answer

Skinning Q#1 independently and slightly differently than @ypercube

with cte as (select row_number() over (partition by product_id,
                                       invoice_id 
                                 order by dt desc) as rn,
                    product_id,
                    invoice_id,
                    amount,dt
               from product ) 
select product_id, invoice_id,amount,dt
  from cte
 where rn=1
 order by product_id,invoice_id;

 product_id | invoice_id | amount |     dt     
------------+------------+--------+------------
 PROD1      | INV01      |      2 | 2018-01-05
 PROD1      | INV03      |      1 | 2018-01-05
 PROD2      | INV02      |      3 | 2018-01-08
 PROD2      | INV04      |      4 | 2018-01-08
(4 rows)

For Q#2, you are on the right track, but the SQL will have a cross join(gasp!)

I think a function with a loop/cursor would be more optimized (i'll try that in my next free block of time)

--the cte will give us the real values
with cte as (select product_id, 
                    sum(amount) as amount, 
                    dt
               from product
              group by product_id,dt)
select p.product_id,  
       (select cte.amount --choose the amount
          from cte
         where cte.product_id = p.product_id
           and cte.dt <= d.gdt -- for same day or earlier
         order by cte.dt desc
         limit 1) as finamt,
       d.gdt
from (select generate_series( (select min(dt)
                                 from product), --where clause if some products 
                                                --don't have an amount
                              (select max(dt)
                                 from product),
                              '1 day' 
                            )::date as gdt)  d
cross join --assuming each listed product has an amount on the min date
     (select distinct product_id
        from product) p
left join --since we need to fill the gaps
     cte on ( d.gdt = cte.dt 
             and p.product_id = cte.product_id)
order by d.gdt, p.product_id
;

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Select first row in each GROUP BY group?

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
_{Old sqlfiddle}

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
_{OLD sqliddle}

Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / `JOIN LATERAL` / correlated subquery

Especially for data distributions with many prices per good:

Optimize GROUP BY query to retrieve latest record per user

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.

Mysql – Select ONE most recent post for each author

If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.

Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.

You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.

This specific kind of query is often called greatest-n-per-group (there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY) and a simulation of this join in MySQL.

The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id values and an additional index on the posts table for the group by.

This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC or ORDER BY pi.date DESC, a.name for example).

Query in PostgreSQL:

SELECT p.* 
FROM authors AS a
   , LATERAL 
       ( SELECT pi.*
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) AS p ;

Query in MySQL:

SELECT p.* 
FROM authors AS a
  JOIN posts AS p
    ON p.id =
       ( SELECT pi.id
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) ;

The useful index is on posts (author_id, date, id) for MySQL and or on posts (author_id, date DESC) for Postgres.

Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL and might be more efficient, under different data distributions. Query:

SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
         p.date DESC ;