Postgresql – Combining separate columns into smallest possible related ranges

gaps-and-islandspostgresqlpostgresql-9.6

I'm trying to discard multiple records that may or may not overlap based on the smallest possible contiguous ranges. I thought of doing something similar to This, however the ranges are numeric strings on separate column and i have on the same query 4 more fields where I only need to take the record with the smallest range

Data with simplified fields

    create table invoices(
    eventname varchar,
    /*...many fields*/
    quantity varchar,
    section varchar,
    rownumber varchar,
    secondrow varchar,
    lowseat varchar,
    highseat varchar,
    /*...some more fields*/
    status varchar,
    /*...even more fields*/
    created_at timestamp default now() not null,
    updated_at timestamp
);

INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '2', '227', '15', null, '9', '10', 'DEPLETED' ,  '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '2', '227', '15', null, '7', '8', 'DEPLETED'  ,  '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '2', '227', '14', null, '23', '24', 'DEPLETED',  '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '1', '227', '13', null, '21', '21', 'DEPLETED',  '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '8', '227', '14', null, '15', '22', 'DEPLETED',  '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '1', '227', '14', null, '1', '1', 'DEPLETED',    '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '2', 'A57', 'GA', null, '1', '2', 'DEPLETED',    '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 4 (Miami vs North Carolina and Duke vs Notre Dame)', '3', 'A57', 'GA', null, '3', '5', 'DEPLETED',    '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('2018 ACC Basketball Tournament - Session 5 (Virginia vs. Clemson and Duke vs. North Carolina)', '3', '228', '14', null, '1', '3', 'DEPLETED', '2019-02-06 00:46:13.286828', null);
INSERT INTO public.invoices (eventname, quantity, section, rownumber, secondrow, lowseat, highseat, status, created_at, updated_at) VALUES ('Penn State Nittany Lions at Pittsburgh Panthers', '2', '227', 'K', null, '25', '26', 'DEPLETED', '2019-02-06 00:46:13.286828', null);

Visual representation:

Group 1

1 | =====
2 |   ===  --> take this record with all its values

Group 2

3 |    === --> take this record

Group 3

4 |       =======
5 |           ==  --> take this record
6 |         =====

Adjacent ranges should be merged.
Lower and upper bounds to be inclusive does fits best for seat numbers.

I did the following and it returns the same values for everything so i know its not right

SELECT distinct section, rownumber,
min(COALESCE(lowseat, '')) over 
(partition by grp) as lowseat,
max(maxhighseat) over (partition by grp) AS highseat
FROM  (
SELECT *, count(nextstart > maxhighseat OR NULL) OVER (PARTITION BY section,
rownumber ORDER BY lowseat desc, highseat desc NULLS LAST) AS grp
FROM  (
  SELECT section, rownumber, lowseat, highseat, max(COALESCE(highseat, '')) OVER (PARTITION BY section, rownumber ORDER BY lowseat, highseat) AS maxhighseat
       , lead(lowseat) OVER (PARTITION BY section, rownumber ORDER BY lowseat, highseat) As nextstart
  FROM invoices where status <> 'DEPLETED' and eventname like 'UCLA%'
  ) a
) b
ORDER  BY 1;

Table important fields look like:

 id | section | row | lowseat | highseat | created_at
----+---------------------------------------------------------------
  1 |      14 |  18 |       1 |       15 | 2019-01-01T00:00:00.000Z
  2 |      14 |  18 |       4 |       15 | 2019-01-01T00:00:00.000Z
  3 |      12 |  13 |       2 |       13 | 2019-02-01T00:00:00.000Z
  4 |      14 |  18 |       4 |       12 | 2019-01-01T00:00:00.000Z

Best Answer

This is a classical gaps-and-island question. The question itself has still a number of gaps, no pun intended. Filling in with some ...

Assumptions

lowseat and highseat seem to be lower & upper bound of your ranges, obviously integer numbers, but stored as varchar. Change that, or you have to add type cast to my following query.
You did not define whether adjacent ranges should be merged or separate. Assuming separate, since those are not strictly "overlapping".
Assuming lower and upper bounds to be inclusive, fits best for seat numbers.
Ignoring query predicates that don't line up with the sample data.

Query

SELECT DISTINCT ON (island) *
FROM  (
   SELECT *
        , highseat - lowseat AS len -- off by 1, but irrelevant
        , count(gap) OVER (ORDER BY rn) AS island
   FROM  (
      SELECT *
           , (lowseat > max(highseat) OVER w) OR NULL AS gap
           , row_number() OVER w AS rn
      FROM   invoices
      WINDOW w AS (ORDER BY lowseat, highseat DESC  -- longest range 1st
                   ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
      ) sub1
   ) sub2
ORDER  BY island, len, lowseat;   -- break ties by picking smallest numbers

db<>fiddle here

This is based on lowseat and highseat, the rest of the row is just ballast.

Related answer with more explanation and an alternative procedural implementation:

Calculate Difference in Overlapping Time in PostgreSQL / SSRS

About DISTINCT ON:

Select first row in each GROUP BY group?

Assumptions / Clarifications

No need to differentiate between infinity and open upper bound (upper(range) IS NULL). (You can have it either way, but it's simpler this way.)

NULL vs. infinity in PostgreSQL range types

Since date is a discrete type, all ranges have default [) bounds. The manual:

The built-in range types int4range, int8range, and daterange all use a canonical form that includes the lower bound and excludes the upper bound; that is, [).

For other types (like tsrange!) I would enforce the same if possible:

Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL

Solution with pure SQL

With CTEs for clarity:

WITH a AS (
   SELECT range
        , COALESCE(lower(range),'-infinity') AS startdate
        , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
   FROM   test
   )
, b AS (
   SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
   FROM   a
   )
, c AS (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM   b
   )
SELECT daterange(min(startdate), max(enddate)) AS range
FROM   c
GROUP  BY grp
ORDER  BY 1;

Or, the same with subqueries, faster but less easy too read:

SELECT daterange(min(startdate), max(enddate)) AS range
FROM  (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM  (
      SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
      FROM  (
         SELECT range
              , COALESCE(lower(range),'-infinity') AS startdate
              , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
         FROM   test
         ) a
      ) b
   ) c
GROUP  BY grp
ORDER  BY 1;

How?

a: While ordering by range, compute the running maximum of the upper bound (enddate) with a window function.
Replace NULL bounds (unbounded) with +/- infinity just to simplify (no special NULL cases).

b: In the same sort order, if the previous enddate is earlier than startdate we have a gap and start a new range (step).
_{Remember, the upper bound is always excluded.}

c: Form groups (grp) by counting steps with another window function.

In the outer SELECT build ranges from lower to upper bound in each group. Voilá.

Or with one less subquery level, but flipping sort order:

SELECT daterange(min(COALESCE(lower(range), '-infinity')), max(enddate)) AS range
FROM  (
   SELECT *, count(nextstart > enddate OR NULL) OVER (ORDER BY range DESC NULLS LAST) AS grp
   FROM  (
      SELECT range
           , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
           , lead(lower(range)) OVER (ORDER BY range) As nextstart
      FROM   test
      ) a
   ) b
GROUP  BY grp
ORDER  BY 1;

Sort the window in the second step with ORDER BY range DESC NULLS LAST (with NULLS LAST) to get perfectly inverted sort order. This should be cheaper (easier to produce, matches sort order of suggested index perfectly) and accurate for corner cases with rank IS NULL. See:

PostgreSQL sort by datetime asc, null first?

Procedural solution with plpgsql

Works for any table / column name, but only for type daterange.
Procedural solutions with loops are typically slower, but in this special case I expect the function to be substantially faster since it only needs a single sequential scan:

CREATE OR REPLACE FUNCTION f_range_agg(_tbl text, _col text)
  RETURNS SETOF daterange AS
$func$
DECLARE
   _lower     date;
   _upper     date;
   _enddate   date;
   _startdate date;
BEGIN
   FOR _lower, _upper IN EXECUTE
      format(
         $sql$
         SELECT COALESCE(lower(t.%2$I),'-infinity')  -- replace NULL with ...
              , COALESCE(upper(t.%2$I), 'infinity')  -- ... +/- infinity
         FROM   %1$I t
         ORDER  BY t.%2$I
         $sql$, _tbl, _col)
   LOOP
      IF _lower > _enddate THEN     -- return previous range
         RETURN NEXT daterange(_startdate, _enddate);
         SELECT _lower, _upper  INTO _startdate, _enddate;
   
      ELSIF _upper > _enddate THEN  -- expand range
         _enddate := _upper;
   
      -- do nothing if _upper <= _enddate (range already included) ...
   
      ELSIF _enddate IS NULL THEN   -- init 1st round
         SELECT _lower, _upper  INTO _startdate, _enddate;
      END IF;
   END LOOP;
   
   IF FOUND THEN                    -- return last row
      RETURN NEXT daterange(_startdate, _enddate);
   END IF;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_range_agg('test', 'range');  -- table and column name

The logic is similar to the SQL solutions, but we can make do with a single pass.

SQL Fiddle.

GROUP BY and aggregate sequential numeric values

The usual drill for handling user input in dynamic SQL:

SQL injection in Postgres functions vs prepared queries

Index

For each of these solutions a plain (default) btree index on range would be instrumental for performance in big tables:

CREATE INDEX foo on test (range);

A btree index is of limited use for range types, but we can get pre-sorted data and maybe even an index-only scan.

Best Answer

Assumptions

Query

Related Solutions

Postgresql – Inserting into related tables

Postgresql – Combining separate ranges into largest possible contiguous ranges

Assumptions / Clarifications

Solution with pure SQL

Procedural solution with plpgsql

Index

Related Question