How to join by the most recent preceding date

teradata

Imagine I have a database with historic price data and a second table with dates

enter image description here

A new row was entered into the price history table every time the price changed.

How can I (efficiently) join the two tables so that I can get the price each customer had to pay on that day?

I need the syntax for teradata, thanks!

Best Answer

I'm afraid that I cannot guarantee 100% Teradata compatibility for my answer as there are, to my knowledge, no Teradata fiddles available. However, I have used PostgreSQL, one of the more standards compliant systems and there are no PostgreSQL specific constructs used in my answer (see the fiddle here), so I think we're good to go!

I glanced at the Teradata documentation for Window functions (LAG and LEAD) and for Common Table Expressions (CTEs) and comparing it with PostgreSQL's syntax (Window functions, CTEs), it all seems fairly vanilla.

First thing I did was CREATE and populate the tables. As an aside, you should avoid using images on StackExchange for the reasons outlined in this link. Use text for DDL and DML or, better yet, provide a fiddle - if you can't find a Teradata one (I looked and couldn't), use something like PostgreSQL which should get you most of the way there.

Another small note, it's not a good idea to name your fields using SQL keywords. Liberal use of the _ (underscore) character is a big help here.

CREATE TABLE price
(
  item_id TEXT NOT NULL,
  item_price SMALLINT NOT NULL,
  change_date DATE NOT NULL
);

INSERT INTO price 
VALUES 
('A', 249, '2014-09-01'),
('B', 129, '2014-09-01'),
('A', 299, '2014-08-25'),
('B', 139, '2014-09-25');

and

CREATE TABLE purchase
(
  customer_id SMALLINT NOT NULL,
  item_id TEXT NOT NULL,
  purchase_date DATE NOT NULL
);

INSERT INTO purchase
VALUES
(1, 'A', '2014-08-27'), (2, 'A', '2014-09-02');

I then ran the following queries - I've left in extraneous fields which aren't required for the final calculation so that my thought processes could be followed - using LAG and LEAD can be tricky and it's easy to mix up which field you should be comparing with which other one, so my policy is to leave them all in and let the OP (Original Poster - i.e. you) decide which you find useful and want to keep and which you wish to eliminate.

WITH pr_cte AS
(
  SELECT 
    p.item_id, 
    LAG(p.item_price, 1) 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lag_price,
    p.item_price,
    LEAD(p.item_price, 1) 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lead_price,
    LAG(change_date, 1, '2000-01-01') 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lag_date,
    p.change_date,
    LEAD(p.change_date, 1, '2038-01-01') 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lead_date
  FROM price p
  -- ORDER BY p.item_id, p.change_date -- not allowed in Teradata according to @dnoeth
)
SELECT * FROM pr_cte;

If you don't have a version with CTEs, you can always use a subquery.

Result:

item_id     lag_price   item_price  lead_price       lag_date  change_date  lead_date
      A                        299          249    2000-01-01   2014-08-25  2014-09-01
      A           299          249                 2014-08-25   2014-09-01  2038-01-01
      B                        129          139    2000-01-01   2014-09-01  2014-09-25
      B           129          139                 2014-09-01   2014-09-25  2038-01-01

Notice the use of the defaults 2000-01-01 and 2038-01-01 - I'm implicitly assuming that the prices before the specified dates in the price table go back to 2000 and that prices after will continue till 2038 when *nix time will explode! Obviously, you can change these for values more suited to your use case. I've left it up to you to provide default prices, again as per your use case.

Then, I ran:

WITH pr_cte AS
(
  SELECT 
    p.item_id, 
    LAG(p.item_price, 1) 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lag_price,
    p.item_price,
    LEAD(p.item_price, 1) 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lead_price,
    LAG(change_date, 1, '2000-01-01') 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lag_date,
    p.change_date,
    LEAD(p.change_date, 1, '2038-01-01') 
      OVER (PARTITION BY p.item_id ORDER BY p.change_date) AS lead_date
  FROM price p
  -- ORDER BY p.item_id, p.change_date -- see above and comment by @dnoeth
)
SELECT 
  pr.item_id, pr.item_price,
  pur.customer_id, pur.item_id, pur.purchase_date
FROM pr_cte pr
JOIN purchase pur 
  ON pr.item_id = pur.item_id
  AND pur.purchase_date >= pr.change_date
  AND pur.purchase_date < pr.lead_date;

Result (again, cull fields as you see fit):

item_id     item_price  customer_id     item_id     purchase_date
      A            299            1           A        2014-08-27
      A            249            2           A        2014-09-02

Which is, AFAICT, the correct answer. You may want to test with edge cases which I may have overlooked - if there's a problem, get back to me. +1 for an interesting and challenging question and welcome to the forum!