Postgresql – Items from last month AND if not possible, items from this month

greatest-n-per-groupjoin;postgresqlview

So the title is not really clear, but I could not think how to put it into shorter words better.

My problem is:

I have a database, which I use for web-scraping, with different articles and their prices. The prices may change each month. To be able to have a history, I created two tables:

  • items which has all articles that are currently in this month's crawling. The table has all the infos on every article
  • log which has only an article-id, price and date

So basically in items every article is unique, while log can have multiple occurrences of an article.

What I want to do now: I have created a view which yields every article from items and among other things calculates the price difference to last month (as the price and date in items is updated each month).

Here's an excerpt from my view:

CREATE VIEW XLSX AS 
    SELECT DISTINCT ON(items.article)
           items.article,
           items.price as "price_new",
           ROUND(CAST(items.price - log.price as numeric), 2) as "price_difference",
           log.price as "price_old"
    FROM   items
    INNER JOIN log 
    ON     items.article = log.article
    WHERE  (items.date - log.date) < 25
    ORDER BY items.article, log.date 

The problem I now have is that new articles are currently not being included in the view. Say for example that at the start of July I have 20 new articles, which are inserted into items and log with date = '2017-07-01'. Obviously they are not included in the view, due to the constraint to only include articles from log that are older than 25 days (since the crawler may run periodically multiple times a month, updating the prices more than once).

Could I use CASE here to get articles that are older than 25 days or if there are none, get those that have the same date?

Best Answer

Use LEFT JOIN LATERAL ... ON TRUE to preserve all rows of the right table:

SELECT i.article
     , i.price AS price_new
     , ROUND((i.price - l.price)::numeric, 2) AS price_difference
     , l.price AS price_old
FROM   items i
LEFT   JOIN LATERAL (
   SELECT price, date
   FROM   log
   WHERE  article = i.article
   AND    date < (i.date - 25)  -- make expression sargable, plus fix off-by-1 error
   ORDER  BY date DESC          -- to make sure ...
   LIMIT  1                     -- ... we pick at most 1 row
   ) l ON TRUE
ORDER  BY i.article, l.date;

Returns all rows in items, and you get NULL for price_difference where no previous older than 25 days is found in log.

I added ORDER BY ... LIMIT 1 to make sure we only pick the one latest entry. If data in table log can be trusted to never return more than one row, you can skip the added lines.

Or, if the query does not actually depend on items.date, a single subquery depending on the current date using DISTINCT ON can be cheaper (with the right indexes):

SELECT i.article
     , i.price AS price_new
     , ROUND((i.price - l.price)::numeric, 2) AS price_difference
     , l.price AS price_old
FROM   items i
LEFT   JOIN (
   SELECT DISTINCT ON (article)
          article, date, price
   FROM   log
   WHERE  date < (CURRENT_DATE - 25)
   ORDER  BY article, date DESC
   ) l USING (article)
ORDER  BY i.article, l.date;

About DISTINCT ON: