So the title is not really clear, but I could not think how to put it into shorter words better.
My problem is:
I have a database, which I use for web-scraping, with different articles and their prices. The prices may change each month. To be able to have a history, I created two tables:
items
which has all articles that are currently in this month's crawling. The table has all the infos on every articlelog
which has only anarticle-id
,price
anddate
So basically in items
every article is unique, while log
can have multiple occurrences of an article.
What I want to do now: I have created a view which yields every article from items
and among other things calculates the price difference to last month (as the price and date in items
is updated each month).
Here's an excerpt from my view:
CREATE VIEW XLSX AS
SELECT DISTINCT ON(items.article)
items.article,
items.price as "price_new",
ROUND(CAST(items.price - log.price as numeric), 2) as "price_difference",
log.price as "price_old"
FROM items
INNER JOIN log
ON items.article = log.article
WHERE (items.date - log.date) < 25
ORDER BY items.article, log.date
The problem I now have is that new articles are currently not being included in the view. Say for example that at the start of July I have 20 new articles, which are inserted into items
and log
with date = '2017-07-01'
. Obviously they are not included in the view, due to the constraint to only include articles from log
that are older than 25 days (since the crawler may run periodically multiple times a month, updating the prices more than once).
Could I use CASE
here to get articles that are older than 25 days or if there are none, get those that have the same date?
Best Answer
Use
LEFT JOIN LATERAL ... ON TRUE
to preserve all rows of the right table:Returns all rows in
items
, and you get NULL forprice_difference
where no previous older than 25 days is found inlog
.I added
ORDER BY ... LIMIT 1
to make sure we only pick the one latest entry. If data in tablelog
can be trusted to never return more than one row, you can skip the added lines.Or, if the query does not actually depend on
items.date
, a single subquery depending on the current date usingDISTINCT ON
can be cheaper (with the right indexes):About
DISTINCT ON
: