PostgreSQL replication: out-of-sequence timeline ID, when former master made slave

postgresqlreplication

I've configured master slave replication in PostgreSQL, following this tutorial:
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#Binary_Replication_in_6_Steps

It worked fine. Now I've promoted the slave, and configured the original master to be a slave to the new master, again following the tutorial linked above.

However, replication no longer works. The new slave (i.e. the former master) logs errors like:

FATAL:  terminating walreceiver process due to administrator command
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0

or like:

LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0

What am I doing wrong?

Details follows

When I promoted the former slave, it seems to have started a new timeline, no 2:

LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  record with zero length at 0/18B5148
LOG:  redo done at 0/18B50F0
LOG:  last completed transaction was at log time 2011-12-05 08:39:43.872041+00
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections

There are 2 timelines in the new master's (i.e. the former slave's) pg_xlog directory:

new-master$ tree pg_xlog
    pg_xlog
    ├── 000000010000000000000001
    ├── 000000020000000000000001
    ├── 00000002.history
    └── archive_status
        └── 00000002.history.ready

I shutdown the new master and the new slave. On the new slave, I do this:

new-slave$ cd /var/lib/pgsql/9.1/data/
new-slave$ rm *
new-slave$ mkdir pg_xlog ; chmod 700 pg_xlog
new-slave$ rsync -a  --exclude pg_xlog --exclude postgresql.conf --exclude recovery.conf --exclude recovery.done --exclude postmaster.pid --exclude 'server.*'  \
     dw0azewdbpv11danny:/var/lib/pgsql/9.1/data/*  .
new-slave$ ... restore postgresql.conf etcetera from backup,
             ... change settings following instructions in tutorial

I start the new master and the new slave. But the slave says FATAL: terminating walreceiver process:

LOG:  database system was interrupted; last known up at 2011-12-06 21:49:49 UTC
LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  entering standby mode
LOG:  streaming replication successfully connected to primary
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
LOG:  unexpected timeline ID 1 in log file 0, segment 1, offset 0
...

The pg_xlog on the new slave:

new-slave$ tree pg_xlog/
pg_xlog/
├── 000000020000000000000001
└── archive_status

Should I have rsynced pg_xlog too? I did that: shutdown master & slave, rsync pg_xlog, start slave, start master. Then the replication seems to work: the new slave says:

LOG:  database system was shut down at 2011-12-06 22:10:14 UTC
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/19A4420
LOG:  database system is ready to accept read only connections
LOG:  record with zero length at 0/19A4420
FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "dw0azewdbpv11danny" (46.137.XX.YY) and accepting
                TCP/IP connections on port 5432?

LOG:  streaming replication successfully connected to primary

But when I do anything at the new master:

insert into moo (mää) values (23);

Then replication fails. The new slave says:

LOG:  could not receive data from client: Connection reset by peer
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0

The master says:

LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  could not receive data from client: Connection reset by peer

But I didn't reset any connection!

If I restart the slave, it fails instantly: FATAL: terminating walreceiver process:

LOG:  database system was shut down in recovery at 2011-12-06 22:24:37 UTC
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/19A4420
LOG:  database system is ready to accept read only connections
LOG:  redo starts at 0/19A4420
LOG:  record with zero length at 0/19A4510
LOG:  streaming replication successfully connected to primary
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command

Config files:

---- new master & slave: ----
wal_level = hot_standby
#archive_mode = off   (default)

---- new master: ----
max_wal_senders = 5
wal_keep_segments = 100

---- new slave: ----
hot_standby = on

---- recovery.conf, on the new slave: ----
standby_mode = 'on'
primary_conninfo = 'host=dw0azewdbpv11danny user=... password=...'

What am I doing wrong? How can I make the former master a slave?

(Is it better if I don't delete all files on the new slave before I rsync?
Or am I supposed to run initdb on the new slave before I rsync?
Do you know if there is any tutorial on how to convert a master to a slave?)

Kind regards, KajMagnus

Best Answer

I found a solution:

Add recovery_target_timeline = 'latest'
to recovery.conf on the new master (the former slave).

Now replication works fine: the new master chooses the correct timeline.

Here's some info on that parameter, and lots of other parameters in recovery.conf:
http://pgpool.projects.postgresql.org/pgpool-II/doc/recovery.conf.sample

Here's some more info on that parameter:
http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP

Related Solutions

PostgreSQL streaming replication broken: invalid magic number, out-of-sequence timeline ID

This doesn't sound like the same thing. The GCC version is different and your PostgreSQL version is different. IIRC it was an issue with GCC 4.6 optimizations and was fixed. I don't believe it should have impacted your setup.

More likely something changed which caused this. Perhaps the slave was promoted to a master briefly and then demoted again? That will create problems like this all the time.

The timeline issue suggests to me that this is in fact what happened. It doesn't matter whether you write to a db or not, just completing recovery and promoting is enough to break the timeline.

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Best Answer

Related Solutions

PostgreSQL streaming replication broken: invalid magic number, out-of-sequence timeline ID

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Related Question