SQL – Handling Sparse Data in Time Series with PostgreSQL

postgresql

[I wasn't sure if this is the proper place to post SQL questions.] I have a public dataset of pharmaceutical prices. Any given drug gets a new price on some unpredictable day, and then the price remains that price until the next price change.

E.g.

drug            date          new_price
acetaminophen   2020-01-09    0.25
oxycontin       2020-01-10    1.40
valaxirin       2020-02-10    2.34
oranicin        2020-02-11    1.54
acetaminophen   2020-02-12    1.47

I have to do a variety of analytics e.g. "what was the price of acetaminophen on 2020-02-01?" Well that would of course be 0.25, but I need a way to figure that out in SQL. I have a variety of more complex queries, e.g. "list the ten cheapest drugs on a given date". So a solution I think needs to be generalized.

I realize that one possible solution would be to run a job that populates the database with prices for every day of the year, but I prefer not to solve the problem that way.

Best Answer

You will need to use a correlated subquery to get the price as of a given date:

SELECT
  price.drug
 ,price.date
 ,price.new_price
FROM
  DrugPrice price
WHERE
  price.Drug = 'acetaminophen'
    AND price.new_price =
      (
        SELECT
          MAX(date)
        FROM
          DrugPrice
        WHERE
          Drug = price.Drug
            AND date <= '2020-02-01'
      )

You will need a unique B-tree index on (Drug,Date) for that query to return consistent results in a reasonable amount of time.

For the ten cheapest type of query you can do something along the lines of:

SELECT
  drug
 ,date
 ,new_price
 ,price_order
FROM
  (
    SELECT
      price.drug
     ,price.date
     ,price.new_price
     ,ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY price.new_price, price.drug) AS price_order
    FROM
      DrugPrice price
    WHERE
      price.Drug = 'acetaminophen'
        AND price.new_price =
          (
            SELECT
              MAX(date)
            FROM
              DrugPrice
            WHERE
              Drug = price.Drug
                AND date <= '2020-02-01'
          )
  ) x
WHERE
  price_order <= 10

This is going to require scanning the whole table, then ordering the rows, which can be expensive. If you can keep the underlying table small then this might not take too long and there may be no need to go further. But if the requirement is to return that data right away, you can create a table like this and run an update to check for changed records each day:

CREATE TABLE PriceSummary
(
  RankTypeCd   CHAR(1)      NOT NULL  --Ascending/descending/other metrics probably make this a FK to a reference table
 ,DrugRank     TINYINT      NOT NULL  --Don't need to rank everything, just maybe top 100
 ,AsOfDt       DATE         NOT NULL
 ,Drug         <whatever>   NOT NULL
 ,CONSTRAINT FK_PriceSummary_Summarizes_Drug FOREIGN KEY (Drug)  REFERENCES <Drug Table> (Drug)
 ,CONSTRAINT PK_PriceSummary PRIMARY KEY (RankTypeCd, DrunkRank, AsOfDt)
)

Then if you wanted to get the least expensive drugs (let's call that RankTypeCd = 'A' for ascending) you could:

SELECT
  PriceSummary.Drug
 ,PriceSummary.DrugRank
FROM
  PriceSummary PriceSummary
WHERE
  PriceSummary.RankTypeCd = 'A'
    AND PriceSummary.DrugRank <= 10
    AND PriceSummary.AsOfDt =
      (
        SELECT
          MAX(AsOfDt)
        FROM
          PriceSummary
        WHERE
          RankTypeCd = PriceSummary.RankTypeCd
            AND DrugRank = PriceSummary.DrugRank
            AND AsOfDt <= '2020-02-01'
      )

Test case

Building on this table:

CREATE TABLE hotel_data (
   hotel_id int
 , day      date  -- using "day", not "date"
 , spaces   int
 , price    int
 , PRIMARY KEY (hotel_id, day)  -- provides essential index automatically
);

Days per hotel_id must be unique (enforced by PK here), or the rest is invalid.

Multicolumn index for base table:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (day, hotel_id);

Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:

Direct query without `MATERIALIZED VIEW`

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , last_value(day) OVER w - day AS day_diff
        , count(*)        OVER w       AS day_ct
   FROM   hotel_data
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    spaces >= 2
   WINDOW w AS (PARTITION BY hotel_id ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to nights - 1
   ) sub
WHERE  day_ct = 4
AND    day_diff = 3  -- make sure there is not gap
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

Also see @ypercube's variant with lag(), which can replace day_ct and day_diff with a single check.

How?

In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
The frame for the window functions spans the current row plus the next num_nights - 1 (4 - 1 = 3) rows (days). Calculate the difference in days , the count of rows and the minimum of spaces to make sure the range is long enough, gapless and always has enough spaces.
- Unfortunately, the frame clause of window functions does not accept dynamic values, so ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING` cannot be parameterized for a prepared statement.
I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
The resulting price sum_price is already multiplied by the number of spaces requested.

With `MATERIALIZED VIEW`

To avoid inspecting many rows without chance of success, save only the columns you need plus three redundant, calculated values from the base table. Be sure the MV is up to date. If you are not familiar with the concept, read the manual first.

CREATE MATERIALIZED VIEW mv_hotel AS
SELECT hotel_id, day
     , first_value(day) OVER (w ORDER BY day) AS range_start
     , price, spaces
     ,(count(*)    OVER w)::int2 AS range_len
     ,(max(spaces) OVER w)::int2 AS max_spaces

FROM  (
   SELECT *
        , day - row_number() OVER (PARTITION BY hotel_id ORDER BY day)::int AS grp
   FROM   hotel_data
   ) sub1
WINDOW w AS (PARTITION BY hotel_id, grp);

range_start stores the first day of each continuous range for two purposes:
- to mark a set of rows as members of a common range
- to show the start of the range for possible other purposes.
range_len is the number of days in the gapless range.
max_spaces is the maximum of open spaces in the range.
- Both columns are used to exclude impossible rows from the query immediately.
I cast both to smallint ( max. 32768 should be plenty for both) to optimize storage: only 52 bytes per row (incl. heap tuple header and item identifier). Details:
- Configuring PostgreSQL for read performance

Multicolumn index for MV:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (range_len, max_spaces, day);

Query based on MV

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , count(*)        OVER w       AS day_ct
   FROM   mv_hotel
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    range_len >= 4   -- exclude impossible rows
   AND    max_spaces >= 2  -- exclude impossible rows
   WINDOW w AS (PARTITION BY hotel_id, range_start ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to $nights - 1
   ) sub
WHERE  day_ct = 4
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

This is faster than the query on the table because more rows can be eliminated immediately. Again, the index is essential. Since partitions are gapless here, checking day_ct is enough.

SQL Fiddle demonstrating both.

Repeated use

If you use it a lot, I would create an SQL function and only pass parameters. Or a PL/pgSQL function with dynamic SQL and EXECUTE to allow adapting the frame clause.

Alternative

Range types with date_range to store continuous ranges in a single row might be an alternative - complicated in your case with potential variations on prices or spaces per day.

PostgreSQL – How to Store Time Series Data

There's really no one 'best way' to store time series data, and it honestly depends on a number of factors. However, I'm going to focus on two factors primarily, with them being:

(1) How serious is this project that it deserves your effort to optimize the schema?

(2) What are your query access patterns really going to be like?

With those questions in mind, let's discuss a few schema options.

Flat Table

The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:

CREATE flat_table(
  trip_id integer,
  tstamp timestamptz,
  speed float,
  distance float,
  temperature float,
  ,...);

There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.

Dimensions and Facts

So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.

Essentially, you'll want a table to record info about the trips,

CREATE trips(
  trip_id integer,
  other_info text);

and a table to record timestamps,

CREATE tstamps(
  tstamp_id integer,
  tstamp timestamptz);

and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))

CREATE meas_facts(
  trip_id integer,
  tstamp_id integer,
  speed float,
  distance float,
  temperature float,
  ,...);

This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.

Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.

Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.

Splitting Up Your Measured Facts

To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:

CREATE speed_facts(
  trip_id integer,
  tstamp_id integer,
  speed float);

and

CREATE distance_facts(
  trip_id integer,
  tstamp_id integer,
  distance float);

Of course, you can see how this might be extended to the other measurements.

Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.

So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.

Arrays/HStore/ & TOAST

You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.

One example implementation could be

CREATE uber_table(
  trip_id integer,
  tstart timestamptz,
  speed float[],
  distance float[],
  temperature float[],
  ,...);

In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.

Another possibility is

CREATE uber_table(
  trip_id integer,
  speed hstore,
  distance hstore,
  temperature hstore,
  ,...);

where you add your measurement values as (key,value) pairs of (timestamp, measurement).

Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.

Conclusions?

Wow, this got much longer than I expected, sorry. :)

Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.

P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)