PostgreSQL – Store Millions of Rows of Denormalized Data or Use SQL Magic?

datetimedenormalizationpostgresqlwindow functions

My DBA experience doesn't go much further than simple storage + retrieval of CMS style data – so this may be a silly question, I don't know!

I have a problem whereby I need to lookup or calculate holiday prices for a certain group size and a certain number of days within a certain time period. E.g.:

How much is a hotel room for 2 people for 4 nights anytime in January?

I have pricing and availability data for, say, 5000 hotels stored like so:

Hotel ID | Date | Spaces | Price PP
-----------------------------------
     123 | Jan1 | 5      | 100
     123 | Jan2 | 7      | 100
     123 | Jan3 | 5      | 100
     123 | Jan4 | 3      | 100
     123 | Jan5 | 5      | 100
     123 | Jan6 | 7      | 110
     456 | Jan1 | 5      | 120
     456 | Jan2 | 1      | 120
     456 | Jan3 | 4      | 130
     456 | Jan4 | 3      | 110
     456 | Jan5 | 5      | 100
     456 | Jan6 | 7      |  90

With this table, I can do a query like so:

SELECT hotel_id, sum(price_pp)
FROM hotel_data
WHERE
    date >= Jan1 and date <= Jan4
    and spaces >= 2
GROUP BY hotel_id
HAVING count(*) = 4;

results

hotel_id | sum
----------------
     123 | 400

The HAVING clause here makes sure that there is an entry for every single day between my desired dates that has the spaces available. ie. Hotel 456 had 1 space available on Jan2, the HAVING clause would return 3, so we don't get a result for hotel 456.

So far so good.

However, is there a way to find out all the 4 night periods in January where there is space available? We could repeat the query 27 times – incrementing the dates each time, which does seem a little bit awkward. Or another way around could be to store all possible combinations in a lookup table like so:

Hotel ID | total price pp | num_people | num_nights | start_date
----------------------------------------------------------------
     123 |            400 | 2          | 4          | Jan1
     123 |            400 | 2          | 4          | Jan2
     123 |            400 | 2          | 4          | Jan3
     123 |            400 | 3          | 4          | Jan1
     123 |            400 | 3          | 4          | Jan2
     123 |            400 | 3          | 4          | Jan3

And so on. We'd have to limit max number of nights, and the max number of people we would search for – e.g. max nights = 28, max people = 10 (limited to the number of spaces available for that set period starting on that date).

For one hotel, this could give us 28*10*365=102000 outcomes per year. 5000 hotels = 500m outcomes!

But we'd have a very simple query to find the cheapest 4 night stay in Jan for 2 people:

SELECT
hotel_id, start_date, price
from hotel_lookup
where num_people=2
and num_nights=4
and start_date >= Jan1
and start_date <= Jan27
order by price
limit 1;

Is there a way to perform this query on the initial table without having to generate the 500m row lookup table!? e.g. generate the 27 possible outcomes in a temporary table or some other such inner query magic?

At the moment all data is held in a Postgres DB – if needs be for this purpose we can move the data out to something else more suitable? Not sure if this type of query fits the map/reduce patterns for NoSQL style DBs …

Best Answer

You can do much with window functions. Presenting two solutions: one with and one without materialized view.

Test case

Building on this table:

CREATE TABLE hotel_data (
   hotel_id int
 , day      date  -- using "day", not "date"
 , spaces   int
 , price    int
 , PRIMARY KEY (hotel_id, day)  -- provides essential index automatically
);

Days per hotel_id must be unique (enforced by PK here), or the rest is invalid.

Multicolumn index for base table:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (day, hotel_id);

Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:

Direct query without `MATERIALIZED VIEW`

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , last_value(day) OVER w - day AS day_diff
        , count(*)        OVER w       AS day_ct
   FROM   hotel_data
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    spaces >= 2
   WINDOW w AS (PARTITION BY hotel_id ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to nights - 1
   ) sub
WHERE  day_ct = 4
AND    day_diff = 3  -- make sure there is not gap
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

Also see @ypercube's variant with lag(), which can replace day_ct and day_diff with a single check.

How?

In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
The frame for the window functions spans the current row plus the next num_nights - 1 (4 - 1 = 3) rows (days). Calculate the difference in days , the count of rows and the minimum of spaces to make sure the range is long enough, gapless and always has enough spaces.
- Unfortunately, the frame clause of window functions does not accept dynamic values, so ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING` cannot be parameterized for a prepared statement.
I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
The resulting price sum_price is already multiplied by the number of spaces requested.

With `MATERIALIZED VIEW`

To avoid inspecting many rows without chance of success, save only the columns you need plus three redundant, calculated values from the base table. Be sure the MV is up to date. If you are not familiar with the concept, read the manual first.

CREATE MATERIALIZED VIEW mv_hotel AS
SELECT hotel_id, day
     , first_value(day) OVER (w ORDER BY day) AS range_start
     , price, spaces
     ,(count(*)    OVER w)::int2 AS range_len
     ,(max(spaces) OVER w)::int2 AS max_spaces

FROM  (
   SELECT *
        , day - row_number() OVER (PARTITION BY hotel_id ORDER BY day)::int AS grp
   FROM   hotel_data
   ) sub1
WINDOW w AS (PARTITION BY hotel_id, grp);

range_start stores the first day of each continuous range for two purposes:
- to mark a set of rows as members of a common range
- to show the start of the range for possible other purposes.
range_len is the number of days in the gapless range.
max_spaces is the maximum of open spaces in the range.
- Both columns are used to exclude impossible rows from the query immediately.
I cast both to smallint ( max. 32768 should be plenty for both) to optimize storage: only 52 bytes per row (incl. heap tuple header and item identifier). Details:
- Configuring PostgreSQL for read performance

Multicolumn index for MV:

CREATE INDEX mv_hotel_mult_idx ON mv_hotel (range_len, max_spaces, day);

Query based on MV

SELECT hotel_id, day, sum_price
FROM  (
   SELECT hotel_id, day, price, spaces
        , sum(price)      OVER w * 2   AS sum_price
        , min(spaces)     OVER w       AS min_spaces
        , count(*)        OVER w       AS day_ct
   FROM   mv_hotel
   WHERE  day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
   AND    range_len >= 4   -- exclude impossible rows
   AND    max_spaces >= 2  -- exclude impossible rows
   WINDOW w AS (PARTITION BY hotel_id, range_start ORDER BY day
                ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to $nights - 1
   ) sub
WHERE  day_ct = 4
AND    min_spaces >= 2
ORDER  BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;

This is faster than the query on the table because more rows can be eliminated immediately. Again, the index is essential. Since partitions are gapless here, checking day_ct is enough.

SQL Fiddle demonstrating both.

Repeated use

If you use it a lot, I would create an SQL function and only pass parameters. Or a PL/pgSQL function with dynamic SQL and EXECUTE to allow adapting the frame clause.

Alternative

Range types with date_range to store continuous ranges in a single row might be an alternative - complicated in your case with potential variations on prices or spaces per day.

Firstly...

Separating Datime/Time into a Date dimension and a Time dimension is definitely the way to go.

To manage multiple time zones you need to duplicate the DateKey and the TimeKey so that you have the following:

LocalDateKey
LocalTimeKey
UtcDateKey
UtcTimeKey

You say...

The problem I am having with all that is that 11:00 PM on Tuesday, December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time zones that are after UTC+2.

By having the 4 columns I've listed above you, will be able to join the fact table to the Date and/or Time dimension Using Table Aliases (in Kimball terminology these aliased dimension tables are known as "Role Playing Dimensions"), so you would have something like the following:

/*
    Assumes the following:
        - [DateLongName] has the format of this example "Tuesday, December 31, 2013"
        - [TimeShortName] has the format of this example "11:00 PM"
        - Both [DateLongName] & [TimeShortName] are strings
*/
select
    -- Returns a string matching this example  "11:00 PM Tuesday, December 31, 2013"
    localTime.TimeShortName + ' ' + localDate.DateLongName
    ,utcTime.TimeShortName + ' ' + utcDate.DateLongName
    ,f.*
from
    FactTableName  AS f

    -- Local Date and Local Time joins          
    inner join dbo.Date  AS localDate
        on localDate.DateKey = f.LocalDateKey

    inner join dbo.Time  AS localTime
        on localTime.TimeKey = f.LocalTimeKey 

    -- Utc Date and Utc Time joins    
    inner join dbo.Date  AS utcDate
        on utcDate.DateKey = f.UtcDateKey

    inner join dbo.Time  AS utcTime
        on utcTime.TimeKey = f.UtcTimeKey

In closing...

As you're building a data mart, and not an OLTP database, the generation of the Local and Utc times should be performed in your ETL, NOT in any client side applications for the following reasons (apart from localisation of the UTC time to the report reader's perspective):

Having the calculation reside in any queries places an extra performance burden on them, multiplied by the number of times you have to run said query for any reports you have (this matters when reading millions of rows)
Extra burden of ensuring the calculation is maintained correctly in each query (especially when you take daylight savings time into account)
Prevent range scanning of any indexes the column is part of, as you'll be performing a calculation on the column which forces queries to perform index scans instead of seeks (which are typically more expensive as each data page is needed to be read); this is known as being non-sargable.
- Edit due to comments: This applies if you push the conversion down into the actual query.
Using the concept of having the additional UTC dates and times available, there is nothing stopping you from taking this concept and extending it by calling this StandardisedDateKey, or CorporateHQDateKey, where instead of a UTC date table you standardise based on some other business agreed standard
Having the two separate column types (Local and UTC), allows for side-by-side comparison across geographic distance. Think -> someone in Australia enters a record that is timestamped with both Local and UTC, someone in New York reads the report with the Local (Australia) date and time and the New York representation of the UTC date and time, thereby seeing that something their Australian counterpart did during the middle of the day (Australia time) happened in the middle of the night their time (New York time). This comparison of time is indispensable in multi-national businesses.

How to find the price based on the given range

I don't get the same output but I think there is an error in your sample with the last row (=1500) and/or a price for QTY>30 is missing.

You can look at this SQL Fiddle and SQL Server query:

With list as (
    Select n = ROW_NUMBER() over(order by (select 1)) From (
        Select n = 1 From (values (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) as x1(n)
        Cross Join (values (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) as x2(n)
    ) as x
)
Select pr.PROD, pr.[date], QTY = count(l.n), Price = sum(pi.price) From list as l
Inner Join Price pi on l.n >= pi.srange and l.n <= pi.erange
Inner Join (
    Select p1.PROD, p1.date, previous = coalesce(SUM(p2.QTY), 0), p1.QTY
    From Product as p1
    Left Join Product  as p2 on p1.PROD = p2.PROD and p2.date < p1.date
    Group By p1.PROD, p1.date, p1.QTY
) as pr on pr.PROD = pi.PROD and l.n > pr.previous and l.n <= (pr.previous+pr.QTY)
Group By pr.PROD, pr.date

Output:

PROD | date         | QTY   | Price
A    | 2015-01-01   | 5     | 520 
A    | 2015-02-01   | 10    | 1100
A    | 2015-03-01   | 5     | 600
A    | 2015-04-01   | 10    | 1200

With 1 more row in Price such as ('A', 31, 50, 150), it gives the output you want:

PROD | date         | QTY   | Price
A    | 2015-01-01   | 5     | 520 
A    | 2015-02-01   | 10    | 1100
A    | 2015-03-01   | 5     | 600
A    | 2015-04-01   | 10    | 1500

SQL Fiddle with Price > 30.

PostgreSQL – Store Millions of Rows of Denormalized Data or Use SQL Magic?

Best Answer

Test case

Multicolumn index for base table:

Direct query without `MATERIALIZED VIEW`

How?

With `MATERIALIZED VIEW`

Multicolumn index for MV:

Query based on MV

Repeated use

Alternative

Related:

Related Question

Best Answer

Test case

Multicolumn index for base table:

Direct query without MATERIALIZED VIEW

How?

With MATERIALIZED VIEW

Multicolumn index for MV:

Query based on MV

Repeated use

Alternative

Related:

Related Solutions

Sql-server – Handling time zones in data mart/warehouse

Firstly...

You say...

In closing...

How to find the price based on the given range

Related Question

Direct query without `MATERIALIZED VIEW`

With `MATERIALIZED VIEW`