SQL – Handling Sparse Data in Time Series with PostgreSQL

postgresql

[I wasn't sure if this is the proper place to post SQL questions.] I have a public dataset of pharmaceutical prices. Any given drug gets a new price on some unpredictable day, and then the price remains that price until the next price change.

E.g.

drug            date          new_price
acetaminophen   2020-01-09    0.25
oxycontin       2020-01-10    1.40
valaxirin       2020-02-10    2.34
oranicin        2020-02-11    1.54
acetaminophen   2020-02-12    1.47

I have to do a variety of analytics e.g. "what was the price of acetaminophen on 2020-02-01?" Well that would of course be 0.25, but I need a way to figure that out in SQL. I have a variety of more complex queries, e.g. "list the ten cheapest drugs on a given date". So a solution I think needs to be generalized.

I realize that one possible solution would be to run a job that populates the database with prices for every day of the year, but I prefer not to solve the problem that way.

Best Answer

You will need to use a correlated subquery to get the price as of a given date:

SELECT
  price.drug
 ,price.date
 ,price.new_price
FROM
  DrugPrice price
WHERE
  price.Drug = 'acetaminophen'
    AND price.new_price =
      (
        SELECT
          MAX(date)
        FROM
          DrugPrice
        WHERE
          Drug = price.Drug
            AND date <= '2020-02-01'
      )

You will need a unique B-tree index on (Drug,Date) for that query to return consistent results in a reasonable amount of time.

For the ten cheapest type of query you can do something along the lines of:

SELECT
  drug
 ,date
 ,new_price
 ,price_order
FROM
  (
    SELECT
      price.drug
     ,price.date
     ,price.new_price
     ,ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY price.new_price, price.drug) AS price_order
    FROM
      DrugPrice price
    WHERE
      price.Drug = 'acetaminophen'
        AND price.new_price =
          (
            SELECT
              MAX(date)
            FROM
              DrugPrice
            WHERE
              Drug = price.Drug
                AND date <= '2020-02-01'
          )
  ) x
WHERE
  price_order <= 10

This is going to require scanning the whole table, then ordering the rows, which can be expensive. If you can keep the underlying table small then this might not take too long and there may be no need to go further. But if the requirement is to return that data right away, you can create a table like this and run an update to check for changed records each day:

CREATE TABLE PriceSummary
(
  RankTypeCd   CHAR(1)      NOT NULL  --Ascending/descending/other metrics probably make this a FK to a reference table
 ,DrugRank     TINYINT      NOT NULL  --Don't need to rank everything, just maybe top 100
 ,AsOfDt       DATE         NOT NULL
 ,Drug         <whatever>   NOT NULL
 ,CONSTRAINT FK_PriceSummary_Summarizes_Drug FOREIGN KEY (Drug)  REFERENCES <Drug Table> (Drug)
 ,CONSTRAINT PK_PriceSummary PRIMARY KEY (RankTypeCd, DrunkRank, AsOfDt)
)

Then if you wanted to get the least expensive drugs (let's call that RankTypeCd = 'A' for ascending) you could:

SELECT
  PriceSummary.Drug
 ,PriceSummary.DrugRank
FROM
  PriceSummary PriceSummary
WHERE
  PriceSummary.RankTypeCd = 'A'
    AND PriceSummary.DrugRank <= 10
    AND PriceSummary.AsOfDt =
      (
        SELECT
          MAX(AsOfDt)
        FROM
          PriceSummary
        WHERE
          RankTypeCd = PriceSummary.RankTypeCd
            AND DrugRank = PriceSummary.DrugRank
            AND AsOfDt <= '2020-02-01'
      )