"Streaming replication" refers to continuous sending of WAL records over a TCP/IP connection between the master and the replica, using the walsender protocol over replication
connections. The master reads its own WAL from pg_xlog
and sends it to the replica on demand. It's configured with a primary_conninfo
directive in recovery.conf
and pg_hba.conf
entries on the master to permit replication
connections. You also need wal_keep_segments
and some other options covered in the docs.
"Log shipping" refers to periodic sending of WAL records as whole WAL archives via a file transfer protocol to an archive location from which the replica can then fetch them. It's configured with a restore_command
directive in recovery.conf
and an archive_command
in the master. PostgreSQL doesn't care where the files are or how they're transferred, only that the archive_command
puts them there and the restore_command
fetches the required archive; this permits the building of systems like PgBarman and WAL-E.
Streaming replication doesn't have as much lag, as records are sent as they are generated. However, it requires both master and replica to be online and able to communicate directly. It also requires the replica to keep up well enough that the master still has on-disk copies of the WAL the replica needs, and generally requires you to spend extra pg_xlog
space on retaining extra WAL for the replica.
Log shipping replication has more lag because the replica only sees WAL once a whole archive is sent. However, it can work even when the master and replica can't communicate directly over TCP/IP by using a shared storage location. It continues to work even if the replica is down for a while, because the master will have discarded the WAL from pg_xlog
only after archiving it, so the WAL is still in the archive and usable by the replica even though the master can't send it by streaming anymore. Note that archive_command
never gives up, so pg_xlog
can fill up if archiving is failing; for that reason it's better to archive to a reliable location and then have the replica server fetch from that location.
In general you actually combine the two, i.e. use both. In that case, streaming replication gets used when everything's going fine. If the replica gets too far behind and the master has discarded xlogs it requires, a connectivity problem arises, etc, then the replica will switch to reading archived WAL until it's caught up. It'll periodically re-try switching back to streaming until it succeeds.
If you're only going to use one, use log shipping, because streaming replication without log shipping fallback is (until PostgreSQL 9.4) potentially prone to replication lag causing failures that force a replica to be re-built .
PostgreSQL 9.4 changes this a bit, because streaming replication can now use "replication slots". That lets the master keep track of how much WAL a replica needs, and avoid throwing it away until the replica has replayed it. So there's no more need for wal_keep_segments
if you use a replication slot (not the default).
See my article streaming replication slots in PostgreSQL 9.4.
9.4 also introduces the foundations for streaming logical replication, which is yet another mechanism, designed for use by logical replication systems like Londiste, Slony-I, and the new bi-directional async multi-master replication feature.
A couple of suggestions:
Use legal, lower-case, unquoted identifiers to save yourself a lot of confusion. end
(and, to a lesser degree start
) are reserved words.
Your id
column seems to be numeric type like integer
, not text
.
Since you don't need a time component in start_day
and end_day
, the appropriate data type is date
, not timestamp
.
Why concatenate id
and the year for a new "ID"? Add a second column year
if you need it for a PK. Or don't add an additional column at all. It can cheaply be extracted from the new start_day
on the fly. Generally, don't store data redundantly if you can avoid it.
Typically, timestamp
ranges use an inclusive lower bound and an exclusive upper bound. Since timestamps can have fractional digits (up to 6 in Postgres) that is much cleaner. Your input 2003-12-31 23:59:59
would fail for 2003-12-31 23:59:59.123
.
So, your table could look like this:
CREATE TABLE ad (
id int PRIMARY KEY
, start_day date NOT NULL -- *inclusive* lower bound
, end_day date NOT NULL -- *exclusive* upper bound
CHECK (end_day > start_day) -- enforce legal input
);
Proper test values:
INSERT INTO ad(id, start_day, end_day)
VALUES
(1, '2003-06-07', '2004-10-01') -- span 2 years (your example)
, (2, '2003-06-07', '2003-06-08') -- 1 day in same year
, (3, '2003-06-07', '2003-10-01') -- span 1 year
, (4, '2003-06-07', '2006-10-01'); -- span many years
Solution
Use generate_series()
in a LATERAL
join, based on start and end day, truncated to the year with date_trunc()
. This produces one row per year with the new start date. Add a year and you have the new end date. Except for first and last row per id
, where you substitute the proper start / end with GREATEST
and LEAST
respectively. Voilá.
-- CREATE TABLE ad_year AS
SELECT ad.id
, extract('year' FROM y)::int AS year
, GREATEST(y , ad.start_day) AS start_day
, LEAST (y + interval '1 year', ad.end_day) AS end_day
FROM ad
, generate_series(date_trunc('year', start_day::timestamp) -- cast to ts here!
, date_trunc('year', end_day::timestamp)
, interval '1 year') y;
Note that date_trunc()
returns timestamptz
for timestamptz
input and timestamp
for timestamp
input. For date
input it defaults to timestamptz
. Since you seem to be ignoring time zones, cast the date
to timestamp
explicitly (start_day::timestamp
).
Result:
id | year | start_day | end_day
----+------+---------------------+---------------------
1 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
1 | 2004 | 2004-01-01 00:00:00 | 2004-10-01 00:00:00
2 | 2003 | 2003-06-07 00:00:00 | 2003-06-08 00:00:00
3 | 2003 | 2003-06-07 00:00:00 | 2003-10-01 00:00:00
4 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
4 | 2004 | 2004-01-01 00:00:00 | 2005-01-01 00:00:00
4 | 2005 | 2005-01-01 00:00:00 | 2006-01-01 00:00:00
4 | 2006 | 2006-01-01 00:00:00 | 2006-10-01 00:00:00
If you create a new table from the result, I suggest (id, year)
as primary key.
Aside: This is not an operator: ||'_'||
(nor a cute little face). It's 2 concatenation operators ||
and a string literal '_'
.
Best Answer
As far as I understand it,
(DATE '2001-02-16', INTERVAL '100 days')
is evaluated as(DATE '2001-02-16', DATE '2001-02-16' + INTERVAL '100 days')
Which in turn is the same as
(DATE '2001-02-16', DATE '2001-05-27')
and that obviously doesn't overlap with(DATE '2001-10-30', DATE '2002-10-30')