Postgresql – Get rows with most recent date for each different item

greatest-n-per-grouppostgresql

Let's say this is the sample date coming from a join of 2 tables. Database is Postgres 9.6

id  product_id  invoice_id  amount       date
1    PROD1       INV01       2          01-01-2018
2    PROD2       INV02       3          01-01-2018
3    PROD1       INV01       2          05-01-2018
4    PROD1       INV03       1          05-01-2018
5    PROD2       INV02       3          08-01-2018
6    PROD2       INV04       4          08-01-2018

I want to know if it's possible in a optimized way to:

  1. Get all the PRODx with their respective INVx which have the latest date, but per product_id. Please note that records unused from a day may be reported to a new one.
    This means:
id  product_id  invoice_id  amount       date
3    PROD1       INV01       2          05-01-2018
4    PROD1       INV03       1          05-01-2018
5    PROD2       INV02       3          08-01-2018
6    PROD2       INV04       4          08-01-2018
  1. Get daily summed amounts for each PRODx but fill the gaps with the previous ones if day does not exist.

This means:

 product_id    amount       date
   PROD1         2          01-01-2018
   PROD2         3          01-01-2018
   PROD1         2          02-01-2018
   PROD2         3          02-01-2018
   PROD1         2          03-01-2018
   PROD2         3          03-01-2018
   PROD1         2          04-01-2018
   PROD2         3          04-01-2018
   PROD1         3          05-01-2018
   PROD2         3          05-01-2018
   PROD1         3          06-01-2018
   PROD2         3          06-01-2018
   PROD1         3          07-01-2018
   PROD2         3          07-01-2018
   PROD1         3          08-01-2018
   PROD2         7          08-01-2018

A few thoughts:

  1. For first question I could obtain the max(date) for each PRODx and the pick for each PRODx the rows that have the date=with max(date) but I was wondering if there's faster way to obtain this given a large number of recors in the database

  2. For the second question, I could generate a series of dates for the interval needed and then use WITH rows As and do the query grouping by product_id and sum by amount and then select for each date the previous values from rows with a limit 1 but that does not sound that optimized either.

Looking forward for any input. Thank you.

Later edit:
Trying to give DISTINCT ON () a try.

  • If I have distinct on(product_id, invoice_id) then I don't get only the most recent ones for the most recent date. If there were invoice_ids in the the past, beside the latest date, then they will be returned
  • If I have distinct on (product_id) then it returns from the most recent date, but as normal, only the last rows even if in the last day I have two positions for PROD1.

Basically I need something like 'I need for the most recent date, all the product_ids and their invoice_ids while keeping in mind that a product_id can have multiple invoice_ids'

Later edit 2:

Running a query like for first question seems to be reasonably fast:

select product_id, invoice_id, amount
from mytable inner join myOtherTable on...
             inner join (select max(date) as last_date, product_id 
                         from mytable 
                         group by product_id) sub on mytable.date = 
                         sub.last_date 

Best Answer

Skinning Q#1 independently and slightly differently than @ypercube

with cte as (select row_number() over (partition by product_id,
                                       invoice_id 
                                 order by dt desc) as rn,
                    product_id,
                    invoice_id,
                    amount,dt
               from product ) 
select product_id, invoice_id,amount,dt
  from cte
 where rn=1
 order by product_id,invoice_id;

 product_id | invoice_id | amount |     dt     
------------+------------+--------+------------
 PROD1      | INV01      |      2 | 2018-01-05
 PROD1      | INV03      |      1 | 2018-01-05
 PROD2      | INV02      |      3 | 2018-01-08
 PROD2      | INV04      |      4 | 2018-01-08
(4 rows)

For Q#2, you are on the right track, but the SQL will have a cross join(gasp!)

I think a function with a loop/cursor would be more optimized (i'll try that in my next free block of time)

--the cte will give us the real values
with cte as (select product_id, 
                    sum(amount) as amount, 
                    dt
               from product
              group by product_id,dt)
select p.product_id,  
       (select cte.amount --choose the amount
          from cte
         where cte.product_id = p.product_id
           and cte.dt <= d.gdt -- for same day or earlier
         order by cte.dt desc
         limit 1) as finamt,
       d.gdt
from (select generate_series( (select min(dt)
                                 from product), --where clause if some products 
                                                --don't have an amount
                              (select max(dt)
                                 from product),
                              '1 day' 
                            )::date as gdt)  d
cross join --assuming each listed product has an amount on the min date
     (select distinct product_id
        from product) p
left join --since we need to fill the gaps
     cte on ( d.gdt = cte.dt 
             and p.product_id = cte.product_id)
order by d.gdt, p.product_id
;