My DBA experience doesn't go much further than simple storage + retrieval of CMS style data – so this may be a silly question, I don't know!
I have a problem whereby I need to lookup or calculate holiday prices for a certain group size and a certain number of days within a certain time period. E.g.:
How much is a hotel room for 2 people for 4 nights anytime in January?
I have pricing and availability data for, say, 5000 hotels stored like so:
Hotel ID | Date | Spaces | Price PP
-----------------------------------
123 | Jan1 | 5 | 100
123 | Jan2 | 7 | 100
123 | Jan3 | 5 | 100
123 | Jan4 | 3 | 100
123 | Jan5 | 5 | 100
123 | Jan6 | 7 | 110
456 | Jan1 | 5 | 120
456 | Jan2 | 1 | 120
456 | Jan3 | 4 | 130
456 | Jan4 | 3 | 110
456 | Jan5 | 5 | 100
456 | Jan6 | 7 | 90
With this table, I can do a query like so:
SELECT hotel_id, sum(price_pp)
FROM hotel_data
WHERE
date >= Jan1 and date <= Jan4
and spaces >= 2
GROUP BY hotel_id
HAVING count(*) = 4;
results
hotel_id | sum
----------------
123 | 400
The HAVING
clause here makes sure that there is an entry for every single day between my desired dates that has the spaces available. ie. Hotel 456 had 1 space available on Jan2, the HAVING clause would return 3, so we don't get a result for hotel 456.
So far so good.
However, is there a way to find out all the 4 night periods in January where there is space available? We could repeat the query 27 times – incrementing the dates each time, which does seem a little bit awkward. Or another way around could be to store all possible combinations in a lookup table like so:
Hotel ID | total price pp | num_people | num_nights | start_date
----------------------------------------------------------------
123 | 400 | 2 | 4 | Jan1
123 | 400 | 2 | 4 | Jan2
123 | 400 | 2 | 4 | Jan3
123 | 400 | 3 | 4 | Jan1
123 | 400 | 3 | 4 | Jan2
123 | 400 | 3 | 4 | Jan3
And so on. We'd have to limit max number of nights, and the max number of people we would search for – e.g. max nights = 28, max people = 10 (limited to the number of spaces available for that set period starting on that date).
For one hotel, this could give us 28*10*365=102000 outcomes per year. 5000 hotels = 500m outcomes!
But we'd have a very simple query to find the cheapest 4 night stay in Jan for 2 people:
SELECT
hotel_id, start_date, price
from hotel_lookup
where num_people=2
and num_nights=4
and start_date >= Jan1
and start_date <= Jan27
order by price
limit 1;
Is there a way to perform this query on the initial table without having to generate the 500m row lookup table!? e.g. generate the 27 possible outcomes in a temporary table or some other such inner query magic?
At the moment all data is held in a Postgres DB – if needs be for this purpose we can move the data out to something else more suitable? Not sure if this type of query fits the map/reduce patterns for NoSQL style DBs …
Best Answer
You can do much with window functions. Presenting two solutions: one with and one without materialized view.
Test case
Building on this table:
Days per
hotel_id
must be unique (enforced by PK here), or the rest is invalid.Multicolumn index for base table:
Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:
Direct query without
MATERIALIZED VIEW
Also see @ypercube's variant with
lag()
, which can replaceday_ct
andday_diff
with a single check.How?
In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
The frame for the window functions spans the current row plus the next
num_nights - 1
(4 - 1 = 3
) rows (days). Calculate the difference in days , the count of rows and the minimum of spaces to make sure the range is long enough, gapless and always has enough spaces.ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING`
cannot be parameterized for a prepared statement.I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
The resulting price
sum_price
is already multiplied by the number of spaces requested.With
MATERIALIZED VIEW
To avoid inspecting many rows without chance of success, save only the columns you need plus three redundant, calculated values from the base table. Be sure the MV is up to date. If you are not familiar with the concept, read the manual first.
range_start
stores the first day of each continuous range for two purposes:range_len
is the number of days in the gapless range.max_spaces
is the maximum of open spaces in the range.I cast both to
smallint
( max. 32768 should be plenty for both) to optimize storage: only 52 bytes per row (incl. heap tuple header and item identifier). Details:Multicolumn index for MV:
Query based on MV
This is faster than the query on the table because more rows can be eliminated immediately. Again, the index is essential. Since partitions are gapless here, checking
day_ct
is enough.SQL Fiddle demonstrating both.
Repeated use
If you use it a lot, I would create an SQL function and only pass parameters. Or a PL/pgSQL function with dynamic SQL and
EXECUTE
to allow adapting the frame clause.Alternative
Range types with
date_range
to store continuous ranges in a single row might be an alternative - complicated in your case with potential variations on prices or spaces per day.Related: