PostgreSQL – Selecting Overlapping Elements from Array of Ranges

arraypostgresqlpostgresql-9.3range-typesset-returning-functions

The table I need to search contains an array of numrange values illustrated by:

CREATE TABLE data ( sensor varchar(25), ranges numrange[] );
INSERT INTO data VALUES
  ('sensor0','{"[872985609.0,873017999.0]","[873021600.0,873035999.0]","[873039600.0,873072070.0]"}'::numrange[])
, ('sensor1','{"[872929250.000000,872985609.000000]"}'::numrange[]);

Searching for rows with elements in ranges that overlap a specified range is easy with ANY:

SELECT * FROM data WHERE '[873021700,873021800]'::numrange && ANY (ranges)

I would like a SELECT statement that returns fields from the table along with only elements from the array that overlap a specified range, aka:

SELECT sensor,[array of overlapping numranges] FROM data WHERE '[873021700,873021800]'::numrange && ANY (ranges)

resulting in:

sensor      ranges
sensor0     {"[873021600.0,873035999.0]"}

Is this possible without unnest()ing the array in a subselect? If not, what would be an efficient way to unnest() when the arrays are potentially very large?

Best Answer

This works as desired:

SELECT d.sensor, r.overlapping_ranges
FROM   data d
JOIN   LATERAL (
   SELECT array_agg(range) AS overlapping_ranges
   FROM   unnest(d.ranges) range
   WHERE  range && '[873021700,873021800]'::numrange 
   ) r ON overlapping_ranges IS NOT NULL;

About LATERAL:

What is the difference between LATERAL and a subquery in PostgreSQL?

For big tables, it would be much more efficient to normalize your design with a separate ranges table (one range per row) instead of the ranges array. You can use a GiST index for that:

Solution for huge table

For a huge table like you mention in the comments (1 billion rows) I would consider a separate ranges table, optimized for size and a BRIN index to go along with it.

Assuming:

Read only (or mostly) data.
A maximum of 6 fractional digits (scale) and a maximum of 18 digits total (precision). Scaled by 1000000, this fits into a bigint without loss, which is considerably cheaper to store. See below.
Postgres 9.5 or later.

The operator class for range types shipped with Postgres 9.5 is range_inclusion_ops, which supports the overlaps operator &&.

To optimize disk space some more I would just save two bigint numbers (your numeric values multiplied by 1000000) and make it a functional BRIN index. Basically like this:

CREATE TABLE sensors (
  sensor_id serial PRIMARY KEY
, sensor text NOT NULL);

CREATE TABLE ranges (
  sensor_id int NOT NULL REFERENCES sensors
, range_low bigint NOT NULL
, range_hi  bigint NOT NULL
);

INSERT INTO sensors (sensor) VALUES ('sensor1');

INSERT INTO ranges (sensor_id, range_low, range_hi) VALUES
   (1, 872985609.0 * 1000000, 873017999.0 * 1000000)  -- scaled
 , (1, 872929250.000000 * 1000000, 872985609.000000 * 1000000);

CREATE INDEX ranges_brin_idx ON ranges USING BRIN (int8range(range_low, range_hi, '[]'));

Query to get the same result as before:

SELECT s.sensor, r.ranges
FROM  (
   SELECT sensor_id
        , array_agg(numrange(range_low * .000001, range_hi * .000001, '[]')) AS ranges
   FROM   ranges
   WHERE  int8range(range_low, range_hi, '[]')
       && '[873021700000000,873021800000000]'::int8range  -- scaled as well
   GROUP  BY sensor_id
   ) r
JOIN   sensors s  USING (sensor_id);

Storage size of `bigint` vs. `numrange`

A numrange with numbers of 15 precision occupies 32 bytes on disk, resulting in 64 bytes per row (plus int column, tuple header and item identifier).

While the same with two bigint columns (2 x 8 bytes) results in 52 bytes total. Makes the table around 12 GB smaller. Index size is the same.

See for yourself:

SELECT pg_column_size((1::bigint, '[873021700.123456,873021800.123456]'::numrange))
     , pg_column_size((1::bigint, 873021700123456::bigint, 873021700123456::bigint));

Detailed explanation for row size:

Configuring PostgreSQL for read performance

The task as I understand it

Pick rows from a table where the period overlaps with a given time frame. Determine distinct ranges of overlapping periods within that set and return the greatest sum(quantity) from any range.

Requires Postgres 9.2+, since there are no range types in older versions.

Assumptions

"Overlapping" is meant in a cascading manner, like tiles on a traditional roof: those are "overlapping" (rainproof), though the highest tile does not directly overlap with the lowest.
All values in period have inclusive bounds ([]). Else you have to adjust for exclusive bounds. (The range of the input parameter can still have arbitrary bounds.)
We filter for exactly one event_type_id. Else you have to add PARTITION BY event_type_id to the window definition.
quantity is an integer. Else you have to adjust for the type in calculations.
Quantities for overlapping periods are counted fully, even if parts of the period are outside your given time frame.
Even works for duplicates on (event_type_id, period).

Best performance with a single subquery

This should be dynamite.

SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
FROM  (
   SELECT lower(period) AS p_start
        ,(sum(quantity)                      OVER w)::int AS running_sum
        , lead(lower(period), 1, 'infinity') OVER w
                        > max(upper(period)) OVER w AS range_end
   FROM   event
   WHERE  event_type_id = 1
   AND    period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange 
   WINDOW w AS (ORDER BY lower(period))
   ) sub
WHERE  range_end
ORDER  BY 1 DESC
LIMIT  1;

All three window functions in the subquery can use the same window. This avoids additional sort operations and should be fastest.

Verbose CTE variant with more explanation

Same query, just more verbose and slower, since CTEs materialize derived tables and pose as optimization barriers.

WITH cte1 AS (
   SELECT quantity
        , lower(period) AS p_start
        , upper(period) AS p_end
   FROM   event
   WHERE  event_type_id = 1
   AND    period && '[2016-01-01 0:0+0,2016-01-10 0:0+0]'::tstzrange
   )
,    cte2 AS (
   SELECT (sum(quantity)               OVER w)::int AS running_sum
        , lead(p_start, 1, 'infinity') OVER w               -- next start ..
                          > max(p_end) OVER w AS range_end  -- .. after last end
        , p_start, p_end
   FROM   cte1
   WINDOW w AS (ORDER BY p_start)
   )
SELECT running_sum - lag(running_sum, 1, 0) OVER (ORDER BY p_start) AS sum_quantity
                    -- subtract the previous sum to get the sum of this range
     , p_end::text
FROM   cte2
WHERE  range_end    -- only rows at the end of each range
ORDER  BY 1 DESC    -- biggest sum first
LIMIT  1;           -- only return the winner

sqlfiddle for Postgres 9.3
db<>fiddle here for Postgres 12

You need an index for this to be fast with big tables. The best option would be a GiST index on (event_type_id, period). Details:

Optimizing queries on a range of timestamps (two columns)

Explanation

Filter rows that match your conditions, then sort by the start of the time range (lower(period)) and calculate:

A running sum of quantity (running_sum).
The start of the next period: (lead(lower(period), 1, 'infinity')). Defaults to 'infinity' for the last row to include the last range.
The latest end of any period so far max(upper(period)).

If 2. is later than 3. it's the end of a (sub-)range (range_end).

In the outer SELECT filter rows with range_end and subtract the previous total to get the sum for the range. ORDER BY that result and return the first (LIMIT 1) greatest sum_quantity. Voilá.

Aside

To select all of Jan 7th, 2016, the clean expression is:

'[2016-01-07 00:00:00+00,2016-01-08 00:00:00+00)'::tstzrange

Not:

'[2016-01-07 00:00:00+00,2016-01-07 23:59:00+00]'::tstzrange

Details:

PostgreSQL splitting time range into days

Since the default precision of timestamp values is 6 decimal digits (microseconds resolution), you could also use:

'[2016-01-07 00:00:00+00,2016-01-07 23:59:59.999999+00]'::tstzrange

But that's messy and depends on an implementation detail that might change (even if unlikely). It's not subject to rounding errors, since timestamps are stored as integer values in modern Postgres:

Ignoring timezones altogether in Rails and PostgreSQL

PostgreSQL – Select into Specific Array Positions with array_agg()

Your answer basically gets the job done:

SELECT b.id, array_agg(b.stock) AS stock
FROM  (
   SELECT i.id, COALESCE(i_s.stock, 0) AS stock
   FROM   item i
   CROSS  JOIN unnest('{1,2}'::int[]) n
   LEFT   JOIN item_stock i_s ON i.id = i_s.item_id AND n.n = i_s.shop_id
   ORDER  BY i.id, n.n
   ) b
GROUP  BY b.id;

Two notable changes:

Order is not guaranteed without ORDER BY in the subquery or as additional clause to array_aggregate() (typically more expensive). And that's the core element of your question.
unnest('{1,2}'::int[]) instead of generate_series(1,2) as requested shop IDs will hardly be sequential all the time.

I also moved the set-returning function from the SELECT list to a separate table expression attached with CROSS JOIN. Standard SQL form, but that's just a matter of clarity and taste, not a necessity. At least in Postgres 10 or later. See:

What is the expected behaviour for multiple set-returning functions in SELECT clause?

Doing the same with LEFT JOIN LATERAL and an ARRAY constructor might be a bit faster as we don't need the outer GROUP BY and the ARRAY constructor is typically faster, too:

SELECT i.id, s.stock
FROM   item i
CROSS  JOIN LATERAL (
   SELECT ARRAY(
      SELECT COALESCE(i_s.stock, 0)
      FROM   unnest('{1,2}'::int[]) n
      LEFT   JOIN item_stock i_s ON i_s.shop_id = n.n
                                AND i_s.item_id = i.id
      ORDER  BY n.n
      ) AS stock
   ) s;

And if you have more than just the two shops, a nested crosstab() should provide top performance:

SELECT i.id, COALESCE(stock, '{0,0}') AS stock
FROM   item i
LEFT   JOIN (
   SELECT id, ARRAY[COALESCE(shop1, 0), COALESCE(shop2, 0)] AS stock
   FROM   crosstab(
     $$SELECT item_id, shop_id, stock
       FROM   item_stock
       WHERE  shop_id = ANY ('{1,2}'::int[])
       ORDER  BY 1,2$$

     , $$SELECT unnest('{1,2}'::int[])$$
      ) AS ct (id int, shop1 int, shop2 int)
   ) i_s USING (id);

Needs to be adapted in more places to cater for different shop IDs.

PostgreSQL Crosstab Query

db<>fiddle here

Index

Make sure you have at least an index on item_stock (shop_id, item_id) - typically provided by a PRIMARY KEY on those columns. For the crosstab query, it also matters that shop_id comes first. See:

Is a composite index also good for queries on the first field?

Adding stock as another index expression may allow faster index-only scans. In Postgres 11 or later consider an INCLUDE item to the PK like so:

PRIMARY KEY (shop_id, item_id) INCLUDE (stock)

But only if you need it a lot, as it makes the index a bit bigger and possibly more susceptible to bloat from updates.