Postgresql – Equal sized timeseries for each id

postgresqltime-series-database

What I'm trying to implement is to generate equal-sized time series for each customer. One some days he may have some activity hence some retention value indicator, on other dates he may not hence 0.

CustRetention table is the following:

cust_id, date, intensity
...
1,2018-02-02,4
1,2018-02-06,5
...
2,2018-05-05,7
3,2018-02-06,4

What is the sql to generate equal time series (say for 240 days) for each cust_id?
I tried the following, but it didn't work

 drop table if exists retention_weekly_intensity_balanced;

 create table retention_weekly_intensity_balanced as 
 select  
 coalesce(t.cust_id, 'unidentified') as cust, 
 d.date_actual, 
 coalesce(t.size_of_increased_intensity) as size_of_increased_intensity
 from  retention_weekly_intensity t
 left join d_date d
ON
 d.date_actual = t.plaindate
WHERE d.date_actual>='2018-01-01'
AND d.date_actual <= '2018-09-01'
 ;

d_date is a simple calendar table

Best Answer

generate_series() returns a SETOF dates between the first and second argument, at intervals specified by the third argument. Seems to be what you're looking for.

testdb=# create table customers(customer_id bigint);
CREATE TABLE
testdb=# insert into customers select 1;
INSERT 0 1
testdb=# insert into customers select 2;
INSERT 0 1
testdb=# select customer_id, d from customers, generate_series('2018-11-01'::timestamp, '2018-12-01', '1 week') AS d order by 1, 2;
 customer_id |          d          
-------------+---------------------
           1 | 2018-11-01 00:00:00
           1 | 2018-11-08 00:00:00
           1 | 2018-11-15 00:00:00
           1 | 2018-11-22 00:00:00
           1 | 2018-11-29 00:00:00
           2 | 2018-11-01 00:00:00
           2 | 2018-11-08 00:00:00
           2 | 2018-11-15 00:00:00
           2 | 2018-11-22 00:00:00
           2 | 2018-11-29 00:00:00
(10 rows)

Test case

Building on this table:

CREATE TABLE hotel_data (
   hotel_id int
 , day      date  -- using "day", not "date"
 , spaces   int
 , price    int
 , PRIMARY KEY (hotel_id, day)  -- provides essential index automatically
);

Days per hotel_id must be unique (enforced by PK here), or the rest is invalid.

Multicolumn index for base table:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (day, hotel_id);

Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:

Direct query without `MATERIALIZED VIEW`

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , last_value(day) OVER w - day AS day_diff
        , count(*)        OVER w       AS day_ct
   FROM   hotel_data
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    spaces >= 2
   WINDOW w AS (PARTITION BY hotel_id ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to nights - 1
   ) sub
WHERE  day_ct = 4
AND    day_diff = 3  -- make sure there is not gap
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

Also see @ypercube's variant with lag(), which can replace day_ct and day_diff with a single check.

How?

In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
The frame for the window functions spans the current row plus the next num_nights - 1 (4 - 1 = 3) rows (days). Calculate the difference in days , the count of rows and the minimum of spaces to make sure the range is long enough, gapless and always has enough spaces.
- Unfortunately, the frame clause of window functions does not accept dynamic values, so ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING` cannot be parameterized for a prepared statement.
I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
The resulting price sum_price is already multiplied by the number of spaces requested.

With `MATERIALIZED VIEW`

To avoid inspecting many rows without chance of success, save only the columns you need plus three redundant, calculated values from the base table. Be sure the MV is up to date. If you are not familiar with the concept, read the manual first.

CREATE MATERIALIZED VIEW mv_hotel AS
SELECT hotel_id, day
     , first_value(day) OVER (w ORDER BY day) AS range_start
     , price, spaces
     ,(count(*)    OVER w)::int2 AS range_len
     ,(max(spaces) OVER w)::int2 AS max_spaces

FROM  (
   SELECT *
        , day - row_number() OVER (PARTITION BY hotel_id ORDER BY day)::int AS grp
   FROM   hotel_data
   ) sub1
WINDOW w AS (PARTITION BY hotel_id, grp);

range_start stores the first day of each continuous range for two purposes:
- to mark a set of rows as members of a common range
- to show the start of the range for possible other purposes.
range_len is the number of days in the gapless range.
max_spaces is the maximum of open spaces in the range.
- Both columns are used to exclude impossible rows from the query immediately.
I cast both to smallint ( max. 32768 should be plenty for both) to optimize storage: only 52 bytes per row (incl. heap tuple header and item identifier). Details:
- Configuring PostgreSQL for read performance

Multicolumn index for MV:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (range_len, max_spaces, day);

Query based on MV

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , count(*)        OVER w       AS day_ct
   FROM   mv_hotel
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    range_len >= 4   -- exclude impossible rows
   AND    max_spaces >= 2  -- exclude impossible rows
   WINDOW w AS (PARTITION BY hotel_id, range_start ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to $nights - 1
   ) sub
WHERE  day_ct = 4
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

This is faster than the query on the table because more rows can be eliminated immediately. Again, the index is essential. Since partitions are gapless here, checking day_ct is enough.

SQL Fiddle demonstrating both.

Repeated use

If you use it a lot, I would create an SQL function and only pass parameters. Or a PL/pgSQL function with dynamic SQL and EXECUTE to allow adapting the frame clause.

Alternative

Range types with date_range to store continuous ranges in a single row might be an alternative - complicated in your case with potential variations on prices or spaces per day.

PostgreSQL Design – Similar Tables and Foreign Keys

Your questions, with my responses:

Does event_config need a reference to event via event_id? It makes sense to know to which event the config belongs to, but at the same time it feels redundant since it's implicit from the relationship with dates.

Adding the event_id to the event_config table allows you to easily design a user-interface for modifying the event_config table, without needing a complex reference to an actual event instance. This might be useful for "default" configs.

This also allows an easier code path for presenting data to the user prior to the user choosing a date.

Isn't event_config an unneeded complexity? Its sole purpose is to provide shared fields to event_date and event_recurrence. Wouldn't it be better just to have the price and capacity columns both in date and recurrence?

event_config looks like a good optimization to me. I see it as a way to manage configurations independently of specific events in time. For instance, you might have a "holiday" config, a config for each weekday, and a config for "saturday" and "sunday", each of which might be exactly the same for each and every event that occurs on those days. Why duplicate that data by storing the info in event_date and event_recurrence?

Best Answer

Related Solutions

PostgreSQL – Store Millions of Rows of Denormalized Data or Use SQL Magic?