PostgreSQL – Understanding OVERLAP Function with Interval

datetimepostgresql

On postgresql documentation whilst reading for date and time operations I have hard time understanding the following example:

SELECT (DATE '2001-02-16', INTERVAL '100 days') OVERLAPS
       (DATE '2001-10-30', DATE '2002-10-30');
Result: false

I failed to understand the reason why it returns false, is it because an Interval is being passed as a parameter?

Best Answer

As far as I understand it, (DATE '2001-02-16', INTERVAL '100 days') is evaluated as (DATE '2001-02-16', DATE '2001-02-16' + INTERVAL '100 days')

Which in turn is the same as (DATE '2001-02-16', DATE '2001-05-27') and that obviously doesn't overlap with (DATE '2001-10-30', DATE '2002-10-30')

Related Solutions

PostgreSQL Streaming vs File-Based Replication – Server Behavior & Configuration

"Streaming replication" refers to continuous sending of WAL records over a TCP/IP connection between the master and the replica, using the walsender protocol over replication connections. The master reads its own WAL from pg_xlog and sends it to the replica on demand. It's configured with a primary_conninfo directive in recovery.conf and pg_hba.conf entries on the master to permit replication connections. You also need wal_keep_segments and some other options covered in the docs.

"Log shipping" refers to periodic sending of WAL records as whole WAL archives via a file transfer protocol to an archive location from which the replica can then fetch them. It's configured with a restore_command directive in recovery.conf and an archive_command in the master. PostgreSQL doesn't care where the files are or how they're transferred, only that the archive_command puts them there and the restore_command fetches the required archive; this permits the building of systems like PgBarman and WAL-E.

Streaming replication doesn't have as much lag, as records are sent as they are generated. However, it requires both master and replica to be online and able to communicate directly. It also requires the replica to keep up well enough that the master still has on-disk copies of the WAL the replica needs, and generally requires you to spend extra pg_xlog space on retaining extra WAL for the replica.

Log shipping replication has more lag because the replica only sees WAL once a whole archive is sent. However, it can work even when the master and replica can't communicate directly over TCP/IP by using a shared storage location. It continues to work even if the replica is down for a while, because the master will have discarded the WAL from pg_xlog only after archiving it, so the WAL is still in the archive and usable by the replica even though the master can't send it by streaming anymore. Note that archive_command never gives up, so pg_xlog can fill up if archiving is failing; for that reason it's better to archive to a reliable location and then have the replica server fetch from that location.

In general you actually combine the two, i.e. use both. In that case, streaming replication gets used when everything's going fine. If the replica gets too far behind and the master has discarded xlogs it requires, a connectivity problem arises, etc, then the replica will switch to reading archived WAL until it's caught up. It'll periodically re-try switching back to streaming until it succeeds.

If you're only going to use one, use log shipping, because streaming replication without log shipping fallback is (until PostgreSQL 9.4) potentially prone to replication lag causing failures that force a replica to be re-built .

PostgreSQL 9.4 changes this a bit, because streaming replication can now use "replication slots". That lets the master keep track of how much WAL a replica needs, and avoid throwing it away until the replica has replayed it. So there's no more need for wal_keep_segments if you use a replication slot (not the default).

See my article streaming replication slots in PostgreSQL 9.4.

9.4 also introduces the foundations for streaming logical replication, which is yet another mechanism, designed for use by logical replication systems like Londiste, Slony-I, and the new bi-directional async multi-master replication feature.

PostgreSQL – How to Split Up Interval into Year Slices

A couple of suggestions:

Use legal, lower-case, unquoted identifiers to save yourself a lot of confusion. end (and, to a lesser degree start) are reserved words.
Your id column seems to be numeric type like integer, not text.
Since you don't need a time component in start_day and end_day, the appropriate data type is date, not timestamp.
Why concatenate id and the year for a new "ID"? Add a second column year if you need it for a PK. Or don't add an additional column at all. It can cheaply be extracted from the new start_day on the fly. Generally, don't store data redundantly if you can avoid it.
Typically, timestamp ranges use an inclusive lower bound and an exclusive upper bound. Since timestamps can have fractional digits (up to 6 in Postgres) that is much cleaner. Your input 2003-12-31 23:59:59 would fail for 2003-12-31 23:59:59.123.

So, your table could look like this:

CREATE TABLE ad (
  id        int PRIMARY KEY
, start_day date NOT NULL    -- *inclusive* lower bound
, end_day   date NOT NULL    -- *exclusive* upper bound
CHECK (end_day > start_day)  -- enforce legal input
);

Proper test values:

INSERT INTO ad(id, start_day, end_day)
VALUES
  (1, '2003-06-07', '2004-10-01')   -- span 2 years (your example)
, (2, '2003-06-07', '2003-06-08')   -- 1 day in same year
, (3, '2003-06-07', '2003-10-01')   -- span 1 year
, (4, '2003-06-07', '2006-10-01');  -- span many years

Solution

Use generate_series() in a LATERAL join, based on start and end day, truncated to the year with date_trunc(). This produces one row per year with the new start date. Add a year and you have the new end date. Except for first and last row per id, where you substitute the proper start / end with GREATEST and LEAST respectively. Voilá.

-- CREATE TABLE ad_year AS
SELECT ad.id
     , extract('year' FROM y)::int AS year
     , GREATEST(y                    , ad.start_day) AS start_day
     , LEAST   (y + interval '1 year', ad.end_day)   AS end_day
FROM   ad
     , generate_series(date_trunc('year', start_day::timestamp)  -- cast to ts here!
                     , date_trunc('year', end_day::timestamp)
                     , interval '1 year') y;

Note that date_trunc() returns timestamptz for timestamptz input and timestamp for timestamp input. For date input it defaults to timestamptz. Since you seem to be ignoring time zones, cast the date to timestamp explicitly (start_day::timestamp).

Result:

 id | year |      start_day      |       end_day
----+------+---------------------+---------------------
  1 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  1 | 2004 | 2004-01-01 00:00:00 | 2004-10-01 00:00:00
  2 | 2003 | 2003-06-07 00:00:00 | 2003-06-08 00:00:00
  3 | 2003 | 2003-06-07 00:00:00 | 2003-10-01 00:00:00
  4 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  4 | 2004 | 2004-01-01 00:00:00 | 2005-01-01 00:00:00
  4 | 2005 | 2005-01-01 00:00:00 | 2006-01-01 00:00:00
  4 | 2006 | 2006-01-01 00:00:00 | 2006-10-01 00:00:00

If you create a new table from the result, I suggest (id, year) as primary key.

Aside: This is not an operator: ||'_'|| (nor a cute little face). It's 2 concatenation operators || and a string literal '_'.

Best Answer

Related Solutions

PostgreSQL Streaming vs File-Based Replication – Server Behavior & Configuration

PostgreSQL – How to Split Up Interval into Year Slices

Solution

Related Question