PostgreSQL – Combining Separate Ranges into Contiguous Ranges

aggregatepostgresqlrange-types

I'm trying to combine multiple date ranges (my load is about max 500, most cases 10) that may or may not overlap into the largest possible contiguous date ranges. For example:

Data:

CREATE TABLE test (
  id SERIAL PRIMARY KEY NOT NULL,
  range DATERANGE
);

INSERT INTO test (range) VALUES 
  (DATERANGE('2015-01-01', '2015-01-05')),
  (DATERANGE('2015-01-01', '2015-01-03')),
  (DATERANGE('2015-01-03', '2015-01-06')),
  (DATERANGE('2015-01-07', '2015-01-09')),
  (DATERANGE('2015-01-08', '2015-01-09')),
  (DATERANGE('2015-01-12', NULL)),
  (DATERANGE('2015-01-10', '2015-01-12')),
  (DATERANGE('2015-01-10', '2015-01-12'));

Table looks like:

 id |          range
----+-------------------------
  1 | [2015-01-01,2015-01-05)
  2 | [2015-01-01,2015-01-03)
  3 | [2015-01-03,2015-01-06)
  4 | [2015-01-07,2015-01-09)
  5 | [2015-01-08,2015-01-09)
  6 | [2015-01-12,)
  7 | [2015-01-10,2015-01-12)
  8 | [2015-01-10,2015-01-12)
(8 rows)

Desired results:

         combined
--------------------------
 [2015-01-01, 2015-01-06)
 [2015-01-07, 2015-01-09)
 [2015-01-10, )

Visual representation:

1 | =====
2 | ===
3 |    ===
4 |        ==
5 |         =
6 |             =============>
7 |           ==
8 |           ==
--+---------------------------
  | ====== == ===============>

Best Answer

Assumptions / Clarifications

No need to differentiate between infinity and open upper bound (upper(range) IS NULL). (You can have it either way, but it's simpler this way.)

NULL vs. infinity in PostgreSQL range types

Since date is a discrete type, all ranges have default [) bounds. The manual:

The built-in range types int4range, int8range, and daterange all use a canonical form that includes the lower bound and excludes the upper bound; that is, [).

For other types (like tsrange!) I would enforce the same if possible:

Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL

Solution with pure SQL

With CTEs for clarity:

WITH a AS (
   SELECT range
        , COALESCE(lower(range),'-infinity') AS startdate
        , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
   FROM   test
   )
, b AS (
   SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
   FROM   a
   )
, c AS (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM   b
   )
SELECT daterange(min(startdate), max(enddate)) AS range
FROM   c
GROUP  BY grp
ORDER  BY 1;

Or, the same with subqueries, faster but less easy too read:

SELECT daterange(min(startdate), max(enddate)) AS range
FROM  (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM  (
      SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
      FROM  (
         SELECT range
              , COALESCE(lower(range),'-infinity') AS startdate
              , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
         FROM   test
         ) a
      ) b
   ) c
GROUP  BY grp
ORDER  BY 1;

How?

a: While ordering by range, compute the running maximum of the upper bound (enddate) with a window function.
Replace NULL bounds (unbounded) with +/- infinity just to simplify (no special NULL cases).

b: In the same sort order, if the previous enddate is earlier than startdate we have a gap and start a new range (step).
_{Remember, the upper bound is always excluded.}

c: Form groups (grp) by counting steps with another window function.

In the outer SELECT build ranges from lower to upper bound in each group. Voilá.

Or with one less subquery level, but flipping sort order:

SELECT daterange(min(COALESCE(lower(range), '-infinity')), max(enddate)) AS range
FROM  (
   SELECT *, count(nextstart > enddate OR NULL) OVER (ORDER BY range DESC NULLS LAST) AS grp
   FROM  (
      SELECT range
           , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
           , lead(lower(range)) OVER (ORDER BY range) As nextstart
      FROM   test
      ) a
   ) b
GROUP  BY grp
ORDER  BY 1;

Sort the window in the second step with ORDER BY range DESC NULLS LAST (with NULLS LAST) to get perfectly inverted sort order. This should be cheaper (easier to produce, matches sort order of suggested index perfectly) and accurate for corner cases with rank IS NULL. See:

PostgreSQL sort by datetime asc, null first?

Procedural solution with plpgsql

Works for any table / column name, but only for type daterange.
Procedural solutions with loops are typically slower, but in this special case I expect the function to be substantially faster since it only needs a single sequential scan:

CREATE OR REPLACE FUNCTION f_range_agg(_tbl text, _col text)
  RETURNS SETOF daterange AS
$func$
DECLARE
   _lower     date;
   _upper     date;
   _enddate   date;
   _startdate date;
BEGIN
   FOR _lower, _upper IN EXECUTE
      format(
         $sql$
         SELECT COALESCE(lower(t.%2$I),'-infinity')  -- replace NULL with ...
              , COALESCE(upper(t.%2$I), 'infinity')  -- ... +/- infinity
         FROM   %1$I t
         ORDER  BY t.%2$I
         $sql$, _tbl, _col)
   LOOP
      IF _lower > _enddate THEN     -- return previous range
         RETURN NEXT daterange(_startdate, _enddate);
         SELECT _lower, _upper  INTO _startdate, _enddate;
   
      ELSIF _upper > _enddate THEN  -- expand range
         _enddate := _upper;
   
      -- do nothing if _upper <= _enddate (range already included) ...
   
      ELSIF _enddate IS NULL THEN   -- init 1st round
         SELECT _lower, _upper  INTO _startdate, _enddate;
      END IF;
   END LOOP;
   
   IF FOUND THEN                    -- return last row
      RETURN NEXT daterange(_startdate, _enddate);
   END IF;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_range_agg('test', 'range');  -- table and column name

The logic is similar to the SQL solutions, but we can make do with a single pass.

SQL Fiddle.

GROUP BY and aggregate sequential numeric values

The usual drill for handling user input in dynamic SQL:

SQL injection in Postgres functions vs prepared queries

Index

For each of these solutions a plain (default) btree index on range would be instrumental for performance in big tables:

CREATE INDEX foo on test (range);

A btree index is of limited use for range types, but we can get pre-sorted data and maybe even an index-only scan.

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Access SQL DateTime Stamps – Flanking DateTime Stamps in TableA and TableB

I must first compliment you on your courage to do something like this with an Access DB, which from my experience is very difficult to do anything SQL-like. Anyways, on to the review.

First join

Your IIF field selections might benefit from using a Switch statement instead. It seems to be sometimes the case, especially with things SQL, that a SWITCH (more commonly known as CASE in typical SQL) is quite fast when just making simple comparisons in the body of a SELECT. The syntax in your case would be almost identical, although a switch can be expanded to cover a large chunk of comparisons in one field. Something to consider.

  SWITCH (
    expr1, val1,
    expr2, val2,
    val3        -- default value or "else"
  )

A switch can also help readability, in larger statements. In context:

  MAX(SWITCH(B.XTStamp <= A.RecTStamp,B.XTStamp,Null)) as BeforeXTStamp,
  --alternatively MAX(-(B.XTStamp<=A.RecTStamp)*B.XTStamp) as BeforeXTStamp,
  MIN(SWITCH(B.XTStamp>A.RecTStamp,B.XTStamp,Null)) as AfterXTStamp

As for the join itself, I think (A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp) is about as good as you're going to get, given what you are trying to do. It's not that fast, but I wouldn't expect it to be either.

Second join

You said this is slower. It's also less readable from a code standpoint. Given equally satisfactory result sets between 1 and 2, I'd say go for 1. At least it's obvious what you are trying to do that way. Subqueries are often not very fast (though often unavoidable) especially in this case you are throwing in an extra join in each, which must certainly complicate the execution plan.

One remark, I saw that you used old ANSI-89 join syntax. It's best to avoid that, the performance will be same or better with the more modern join syntax, and they are less ambiguous or easier to read, harder to make mistakes.

FROM (FirstTable AS A INNER JOIN 
  (select top 1 B1.XTStamp, A1.RecTStamp 
   from SecondTable as B1
   inner join FirstTable as A1
     on B1.XTStamp <= A1.RecTStamp
   order by B1.XTStamp DESC) AS AbyB1 --MAX (time points before)

Naming things

I think the way your things are named is unhelpful at best, and cryptic at worst. A, B, A1, B1 etc. as table aliases I think could be better. Also, I think the field names are not very good, but I realize you may not have control over this. I will just quickly quote The Codeless Code on the topic of naming things, and leave it at that...

“Invective!” answered the priestess. “Verb your expletive nouns!”

"Next steps" query

I couldn't make much sense of it how it was written, I had to take it to a text editor and do some style changes to make it more readable. I know Access' SQL editor is beyond clunky, so I usually write my queries in a good editor like Notepad++ or Sublime Text. Some of the stylistic changes I applied to make it more readable:

4 spaces indent instead of 2 spaces
Spaces around mathematical and comparison operators
More natural placing of braces and indentation (I went with Java-style braces, but could also be C-style, at your preference)

So as it turns out, this is a very complicated query indeed. To make sense of it, I have to start from the innermost query, your ID data set, which I understand is the same as your First Join. It returns the IDs and timestamps of the devices where the before/after timestamps are the closest, within the subset of devices you are interested in. So instead of ID why not call it ClosestTimestampID.

Your Det join is used only once:

The rest of the time, it only joins the values you already have from ClosestTimestampID. So instead we should be able to just do this:

    ) AS ClosestTimestampID
    INNER JOIN SecondTable AS TL1 
        ON ClosestTimestampID.BeforeXTStamp = TL1.XTStamp) 
    INNER JOIN SecondTable AS TL2 
        ON ClosestTimestampID.AfterXTStamp = TL2.XTStamp
    WHERE ClosestTimestampID.XmitID IN (<limited subset S>)

Maybe not be a huge performance gain, but anything we can do to help the poor Jet DB optimizer will help!

I can't shake the feeling that the calculations/algorithm for BeforeWeight and AfterWeight which you use to interpolate could be done better, but unfortunately I'm not very good with those.

One suggestion to avoid crashing (although it's not ideal depending on your application) would be to break out your nested subqueries into tables of their own and update those when needed. I'm not sure how often you need your source data to be refreshed, but if it is not too often you might think of writing some VBA code to schedule an update of the tables and derived tables, and just leave your outermost query to pull from those tables instead of the original source. Just a thought, like I said not ideal but given the tool you may not have a choice.

Everything together:

SELECT
    InGPS.XmitID,
    StrDateIso8601Msec(InGPS.RecTStamp) AS RecTStamp_ms,
       -- StrDateIso8601MSec is a VBA function returning a TEXT string in yyyy-mm-dd hh:nn:ss.lll format
    InGPS.ReceivID,
    RD.Receiver_Location_Description,
    RD.Lat AS Receiver_Lat,
    RD.Lon AS Receiver_Lon,
    InGPS.Before_Lat * InGPS.BeforeWeight + InGPS.After_Lat * InGPS.AfterWeight AS Xmit_Lat,
    InGPS.Before_Lon * InGPS.BeforeWeight + InGPS.After_Lon * InGPS.AfterWeight AS Xmit_Lon,
    InGPS.RecTStamp AS RecTStamp_basic
FROM (
    SELECT 
        ClosestTimestampID.RecTStamp,
        ClosestTimestampID.XmitID,
        ClosestTimestampID.ReceivID,
        ClosestTimestampID.BeforeXTStamp, 
        TL1.Latitude AS Before_Lat, 
        TL1.Longitude AS Before_Lon,
        (1 - ((ClosestTimestampID.RecTStamp - ClosestTimestampID.BeforeXTStamp) 
            / (ClosestTimestampID.AfterXTStamp - ClosestTimestampID.BeforeXTStamp))) AS BeforeWeight,
        ClosestTimestampID.AfterXTStamp, 
        TL2.Latitude AS After_Lat, 
        TL2.Longitude AS After_Lon,
        (     (ClosestTimestampID.RecTStamp - ClosestTimestampID.BeforeXTStamp) 
            / (ClosestTimestampID.AfterXTStamp - ClosestTimestampID.BeforeXTStamp)) AS AfterWeight
        FROM (((
            SELECT 
                A.RecTStamp, 
                A.ReceivID, 
                A.XmitID,
                MAX(SWITCH(B.XTStamp <= A.RecTStamp, B.XTStamp, Null)) AS BeforeXTStamp,
                MIN(SWITCH(B.XTStamp > A.RecTStamp, B.XTStamp, Null)) AS AfterXTStamp
            FROM FirstTable AS A
            INNER JOIN SecondTable AS B 
                ON (A.RecTStamp <> B.XTStamp OR A.RecTStamp = B.XTStamp)
            WHERE A.XmitID IN (<limited subset S>)
            GROUP BY A.RecTStamp, ReceivID, XmitID
        ) AS ClosestTimestampID
        INNER JOIN FirstTable AS Det 
            ON (Det.XmitID = ClosestTimestampID.XmitID) 
            AND (Det.ReceivID = ClosestTimestampID.ReceivID) 
            AND (Det.RecTStamp = ClosestTimestampID.RecTStamp)) 
        INNER JOIN SecondTable AS TL1 
            ON ClosestTimestampID.BeforeXTStamp = TL1.XTStamp) 
        INNER JOIN SecondTable AS TL2 
            ON ClosestTimestampID.AfterXTStamp = TL2.XTStamp
        WHERE Det.XmitID IN (<limited subset S>)
    ) AS InGPS
INNER JOIN ReceiverDetails AS RD 
    ON (InGPS.ReceivID = RD.ReceivID) 
    AND (InGPS.RecTStamp BETWEEN <valid parameters from another table>)
ORDER BY StrDateIso8601Msec(InGPS.RecTStamp), InGPS.ReceivID;