PostgreSQL replication: out-of-sequence timeline ID, when former master made slave

postgresqlreplication

I've configured master slave replication in PostgreSQL, following this tutorial:
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps

It worked fine. Now I've promoted the slave, and configured the original master to be a slave to the new master, again following the tutorial linked above.

However, replication no longer works. The new slave (i.e. the former master) logs errors like:

FATAL:  terminating walreceiver process due to administrator command
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0

or like:

LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0

What am I doing wrong?

Details follows

When I promoted the former slave, it seems to have started a new timeline, no 2:

LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  record with zero length at 0/18B5148
LOG:  redo done at 0/18B50F0
LOG:  last completed transaction was at log time 2011-12-05 08:39:43.872041+00
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections

There are 2 timelines in the new master's (i.e. the former slave's) pg_xlog directory:

new-master$ tree pg_xlog
    pg_xlog
    ├── 000000010000000000000001
    ├── 000000020000000000000001
    ├── 00000002.history
    └── archive_status
        └── 00000002.history.ready

I shutdown the new master and the new slave. On the new slave, I do this:

new-slave$ cd /var/lib/pgsql/9.1/data/
new-slave$ rm *
new-slave$ mkdir pg_xlog ; chmod 700 pg_xlog
new-slave$ rsync -a  --exclude pg_xlog --exclude postgresql.conf --exclude recovery.conf --exclude recovery.done --exclude postmaster.pid --exclude 'server.*'  \
     dw0azewdbpv11danny:/var/lib/pgsql/9.1/data/*  .
new-slave$ ... restore postgresql.conf etcetera from backup,
             ... change settings following instructions in tutorial

I start the new master and the new slave. But the slave says FATAL: terminating walreceiver process:

LOG:  database system was interrupted; last known up at 2011-12-06 21:49:49 UTC
LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  entering standby mode
LOG:  streaming replication successfully connected to primary
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
...

The pg_xlog on the new slave:

new-slave$ tree pg_xlog/
pg_xlog/
├── 000000020000000000000001
└── archive_status

Should I have rsynced pg_xlog too? I did that: shutdown master & slave, rsync pg_xlog, start slave, start master. Then the replication seems to work: the new slave says:

LOG:  database system was shut down at 2011-12-06 22:10:14 UTC
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/19A4420
LOG:  database system is ready to accept read only connections
LOG:  record with zero length at 0/19A4420
FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "dw0azewdbpv11danny" (46.137.XX.YY) and accepting
                TCP/IP connections on port 5432?

LOG:  streaming replication successfully connected to primary

But when I do anything at the new master:

insert into moo (mää) values (23);

Then replication fails. The new slave says:

LOG:  could not receive data from client: Connection reset by peer
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0

The master says:

LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  could not receive data from client: Connection reset by peer

But I didn't reset any connection!

If I restart the slave, it fails instantly: FATAL: terminating walreceiver process:

LOG:  database system was shut down in recovery at 2011-12-06 22:24:37 UTC
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/19A4420
LOG:  database system is ready to accept read only connections
LOG:  redo starts at 0/19A4420
LOG:  record with zero length at 0/19A4510
LOG:  streaming replication successfully connected to primary
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command

Config files:

---- new master & slave: ----
wal_level = hot_standby
#archive_mode = off   (default)

---- new master: ----
max_wal_senders = 5
wal_keep_segments = 100

---- new slave: ----
hot_standby = on

---- recovery.conf, on the new slave: ----
standby_mode = 'on'
primary_conninfo = 'host=dw0azewdbpv11danny user=... password=...'

What am I doing wrong? How can I make the former master a slave?

(Is it better if I don't delete all files on the new slave before I rsync?
Or am I supposed to run initdb on the new slave before I rsync?
Do you know if there is any tutorial on how to convert a master to a slave?)

Kind regards, KajMagnus

Best Answer

I found a solution:

Add recovery_target_timeline = 'latest'
to recovery.conf on the new master (the former slave).

Now replication works fine: the new master chooses the correct timeline.

Here's some info on that parameter, and lots of other parameters in recovery.conf:
http://pgpool.projects.postgresql.org/pgpool-II/doc/recovery.conf.sample

Here's some more info on that parameter:
http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP