The task as I understand it
Pick rows from a table where the period
overlaps with a given time frame. Determine distinct ranges of overlapping periods within that set and return the greatest sum(quantity)
from any range.
Requires Postgres 9.2+, since there are no range types in older versions.
Assumptions
"Overlapping" is meant in a cascading manner, like tiles on a traditional roof: those are "overlapping" (rainproof), though the highest tile does not directly overlap with the lowest.
All values in period
have inclusive bounds ([]
). Else you have to adjust for exclusive bounds. (The range of the input parameter can still have arbitrary bounds.)
We filter for exactly one event_type_id
. Else you have to add PARTITION BY event_type_id
to the window definition.
quantity
is an integer
. Else you have to adjust for the type in calculations.
Quantities for overlapping periods are counted fully, even if parts of the period are outside your given time frame.
Even works for duplicates on (event_type_id, period)
.
Best performance with a single subquery
This should be dynamite.
SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
FROM (
SELECT lower(period) AS p_start
,(sum(quantity) OVER w)::int AS running_sum
, lead(lower(period), 1, 'infinity') OVER w
> max(upper(period)) OVER w AS range_end
FROM event
WHERE event_type_id = 1
AND period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange
WINDOW w AS (ORDER BY lower(period))
) sub
WHERE range_end
ORDER BY 1 DESC
LIMIT 1;
All three window functions in the subquery can use the same window. This avoids additional sort operations and should be fastest.
Verbose CTE variant with more explanation
Same query, just more verbose and slower, since CTEs materialize derived tables and pose as optimization barriers.
WITH cte1 AS (
SELECT quantity
, lower(period) AS p_start
, upper(period) AS p_end
FROM event
WHERE event_type_id = 1
AND period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange
)
, cte2 AS (
SELECT (sum(quantity) OVER w)::int AS running_sum
, lead(p_start, 1, 'infinity') OVER w -- next start ..
> max(p_end) OVER w AS range_end -- .. after last end
, p_start, p_end
FROM cte1
WINDOW w AS (ORDER BY p_start)
)
SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
-- subtract the previous sum to get the sum of this range
, p_end::text
FROM cte2
WHERE range_end -- only rows at the end of each range
ORDER BY 1 DESC -- biggest sum first
LIMIT 1; -- only return the winner
sqlfiddle for Postgres 9.3
db<>fiddle here for Postgres 12
You need an index for this to be fast with big tables. The best option would be a GiST index on (event_type_id, period)
. Details:
Explanation
Filter rows that match your conditions, then sort by the start of the time range (lower(period)
) and calculate:
- A running sum of quantity (
running_sum
).
- The start of the next period: (
lead(lower(period), 1, 'infinity')
).
Defaults to 'infinity' for the last row to include the last range.
- The latest end of any period so far
max(upper(period))
.
If 2. is later than 3. it's the end of a (sub-)range (range_end
).
In the outer SELECT
filter rows with range_end
and subtract the previous total to get the sum for the range. ORDER BY
that result and return the first (LIMIT 1
) greatest sum_quantity
. Voilá.
Aside
To select all of Jan 7th, 2016, the clean expression is:
'[2016-01-07 00:00:00+00,2016-01-08 00:00:00+00)'::tstzrange
Not:
'[2016-01-07 00:00:00+00,2016-01-07 23:59:00+00]'::tstzrange
Details:
Since the default precision of timestamp values is 6 decimal digits (microseconds resolution), you could also use:
'[2016-01-07 00:00:00+00,2016-01-07 23:59:59.999999+00]'::tstzrange
But that's messy and depends on an implementation detail that might change (even if unlikely). It's not subject to rounding errors, since timestamps are stored as integer values in modern Postgres:
Your answer basically gets the job done:
SELECT b.id, array_agg(b.stock) AS stock
FROM (
SELECT i.id, COALESCE(i_s.stock, 0) AS stock
FROM item i
CROSS JOIN unnest('{1,2}'::int[]) n
LEFT JOIN item_stock i_s ON i.id = i_s.item_id AND n.n = i_s.shop_id
ORDER BY i.id, n.n
) b
GROUP BY b.id;
Two notable changes:
Order is not guaranteed without ORDER BY
in the subquery or as additional clause to array_aggregate()
(typically more expensive). And that's the core element of your question.
unnest('{1,2}'::int[])
instead of generate_series(1,2)
as requested shop IDs will hardly be sequential all the time.
I also moved the set-returning function from the SELECT
list to a separate table expression attached with CROSS JOIN
. Standard SQL form, but that's just a matter of clarity and taste, not a necessity. At least in Postgres 10 or later. See:
Doing the same with LEFT JOIN LATERAL
and an ARRAY constructor might be a bit faster as we don't need the outer GROUP BY
and the ARRAY constructor is typically faster, too:
SELECT i.id, s.stock
FROM item i
CROSS JOIN LATERAL (
SELECT ARRAY(
SELECT COALESCE(i_s.stock, 0)
FROM unnest('{1,2}'::int[]) n
LEFT JOIN item_stock i_s ON i_s.shop_id = n.n
AND i_s.item_id = i.id
ORDER BY n.n
) AS stock
) s;
Related:
And if you have more than just the two shops, a nested crosstab()
should provide top performance:
SELECT i.id, COALESCE(stock, '{0,0}') AS stock
FROM item i
LEFT JOIN (
SELECT id, ARRAY[COALESCE(shop1, 0), COALESCE(shop2, 0)] AS stock
FROM crosstab(
$$SELECT item_id, shop_id, stock
FROM item_stock
WHERE shop_id = ANY ('{1,2}'::int[])
ORDER BY 1,2$$
, $$SELECT unnest('{1,2}'::int[])$$
) AS ct (id int, shop1 int, shop2 int)
) i_s USING (id);
Needs to be adapted in more places to cater for different shop IDs.
Related:
db<>fiddle here
Index
Make sure you have at least an index on item_stock (shop_id, item_id)
- typically provided by a PRIMARY KEY
on those columns. For the crosstab query, it also matters that shop_id
comes first. See:
Adding stock
as another index expression may allow faster index-only scans. In Postgres 11 or later consider an INCLUDE
item to the PK like so:
PRIMARY KEY (shop_id, item_id) INCLUDE (stock)
But only if you need it a lot, as it makes the index a bit bigger and possibly more susceptible to bloat from updates.
Best Answer
This works as desired:
About
LATERAL
:For big tables, it would be much more efficient to normalize your design with a separate
ranges
table (one range per row) instead of the ranges array. You can use a GiST index for that:Solution for huge table
For a huge table like you mention in the comments (1 billion rows) I would consider a separate
ranges
table, optimized for size and a BRIN index to go along with it.Assuming:
bigint
without loss, which is considerably cheaper to store. See below.The operator class for range types shipped with Postgres 9.5 is
range_inclusion_ops
, which supports the overlaps operator&&
.To optimize disk space some more I would just save two
bigint
numbers (your numeric values multiplied by 1000000) and make it a functional BRIN index. Basically like this:Query to get the same result as before:
Storage size of
bigint
vs.numrange
A
numrange
with numbers of 15 precision occupies 32 bytes on disk, resulting in 64 bytes per row (plus int column, tuple header and item identifier).While the same with two
bigint
columns (2 x 8 bytes) results in 52 bytes total. Makes the table around 12 GB smaller. Index size is the same.See for yourself:
Detailed explanation for row size: