Postgresql – Items from last month AND if not possible, items from this month

greatest-n-per-groupjoin;postgresqlview

So the title is not really clear, but I could not think how to put it into shorter words better.

My problem is:

I have a database, which I use for web-scraping, with different articles and their prices. The prices may change each month. To be able to have a history, I created two tables:

items which has all articles that are currently in this month's crawling. The table has all the infos on every article
log which has only an article-id, price and date

So basically in items every article is unique, while log can have multiple occurrences of an article.

What I want to do now: I have created a view which yields every article from items and among other things calculates the price difference to last month (as the price and date in items is updated each month).

Here's an excerpt from my view:

CREATE VIEW XLSX AS 
    SELECT DISTINCT ON(items.article)
           items.article,
           items.price as "price_new",
           ROUND(CAST(items.price - log.price as numeric), 2) as "price_difference",
           log.price as "price_old"
    FROM   items
    INNER JOIN log 
    ON     items.article = log.article
    WHERE  (items.date - log.date) < 25
    ORDER BY items.article, log.date

The problem I now have is that new articles are currently not being included in the view. Say for example that at the start of July I have 20 new articles, which are inserted into items and log with date = '2017-07-01'. Obviously they are not included in the view, due to the constraint to only include articles from log that are older than 25 days (since the crawler may run periodically multiple times a month, updating the prices more than once).

Could I use CASE here to get articles that are older than 25 days or if there are none, get those that have the same date?

Best Answer

Use LEFT JOIN LATERAL ... ON TRUE to preserve all rows of the right table:

SELECT i.article
     , i.price AS price_new
     , ROUND((i.price - l.price)::numeric, 2) AS price_difference
     , l.price AS price_old
FROM   items i
LEFT   JOIN LATERAL (
   SELECT price, date
   FROM   log
   WHERE  article = i.article
   AND    date < (i.date - 25)  -- make expression sargable, plus fix off-by-1 error
   ORDER  BY date DESC          -- to make sure ...
   LIMIT  1                     -- ... we pick at most 1 row
   ) l ON TRUE
ORDER  BY i.article, l.date;

Returns all rows in items, and you get NULL for price_difference where no previous older than 25 days is found in log.

What is the difference between LATERAL and a subquery in PostgreSQL?

I added ORDER BY ... LIMIT 1 to make sure we only pick the one latest entry. If data in table log can be trusted to never return more than one row, you can skip the added lines.

Or, if the query does not actually depend on items.date, a single subquery depending on the current date using DISTINCT ON can be cheaper (with the right indexes):

SELECT i.article
     , i.price AS price_new
     , ROUND((i.price - l.price)::numeric, 2) AS price_difference
     , l.price AS price_old
FROM   items i
LEFT   JOIN (
   SELECT DISTINCT ON (article)
          article, date, price
   FROM   log
   WHERE  date < (CURRENT_DATE - 25)
   ORDER  BY article, date DESC
   ) l USING (article)
ORDER  BY i.article, l.date;

About DISTINCT ON:

Select first row in each GROUP BY group?

Related Solutions

Postgresql – Storing data in PostgreSQL: One table or two

If you mix the history data in with the current like that in order to speed up queries over a time period, you do so at the expense of slowing down queries for current data. You can add an extra column to explicitly mark the relevant rows as the current prices (and have id+flag as the PK) but that adds extra work to your business logic to both keep it maintained and respect it in all reports.

Of course if you move the price completely out of the main table you have a similar problem in finding the latest price becoming more expensive unless you have a "latest" flag or do-normalise slightly and keep a copy of the current price in the main table as well as the price history table. Personally I would do the latter, and use a trigger on that product table to automatically update the price history table when a new product is added or the price updated (I'm assuming the products tables does no see massive write activity most of the time so the performance impact of using a trigger here should be minimal) in order to remove that auditing task from your other logic to avoid bugs caused by new code turning up that forgets to update the history/audit.

_{Caveat: I'm answering this from a general PoV as I've not used postgres much in anger, so do some benchmarks before taking anything I've said regarding performance as fact in that system!}

MySQL Aggregation – Nested Aggregation and Grouping on Multiple Columns

This is a "greatest-n-per-group" query and it's quite complicated to write in MySQL - first due to lack of window functions and second because you have 2 greatest-n-per-group specifications, first for the latest date per vendor and second for the lowest price per subcategory.

Here's one, rather complex way to write this:

SELECT 
    ps.subcategory_id,  ps.vendor_id,  ps.price   -- , p.inserted_at
FROM
    ( SELECT DISTINCT subcategory_id
      FROM product
      WHERE category_id = 1
    ) AS s
  JOIN
    product AS ps
      ON  ps.category_id = 1
      AND ps.subcategory_id = s.subcategory_id
      AND ps.id =
          ( SELECT psv.id
            FROM
                ( SELECT DISTINCT subcategory_id, vendor_id
                  FROM product
                  WHERE category_id = 1
                ) AS sv
              JOIN
                product AS psv
                ON  psv.category_id = 1
                AND psv.subcategory_id = sv.subcategory_id
                AND psv.vendor_id = sv.vendor_id
                AND psv.inserted_at =
                    ( SELECT pi.inserted_at
                      FROM product AS pi
                      WHERE pi.category_id = 1
                        AND pi.subcategory_id = sv.subcategory_id
                        AND pi.vendor_id = sv.vendor_id
                      ORDER BY pi.inserted_at DESC
                      LIMIT 1
                    )
            WHERE sv.subcategory_id = s.subcategory_id
            ORDER BY psv.price,
                     psv.vendor_id
            LIMIT 1
          ) ;

Tested in SQLfiddle-2. Not too bad of a plan either, with a proper index on (category_id, subcategory_id, vendor_id, inserted_at).

It may not be the most efficient and I would definitely experiment with indexes (see the Fiddle, where I have one more index. It may not be very useful but test t on bigger tables.)

(the first version of the query in SQLfidle-1)

Best Answer

Related Solutions

Postgresql – Storing data in PostgreSQL: One table or two

MySQL Aggregation – Nested Aggregation and Grouping on Multiple Columns

Related Question