PostgreSQL streaming replication broken: invalid magic number, out-of-sequence timeline ID

postgresqlreplication

For some reason, my PostgreSQL slave no longer streaming-replicates changes on the master. It was working before, for a while, but I recently noticed that the slave's database contents is old, and there are errors in the slave's log files, namely "invalid magic number 0000" and "out-of-sequence timeline ID 1 (after 2)".

Do you have any idea how can I troubleshoot this? Or why does it happen?

Details follow.

Log file error messages: (on the slave, pg_log/postgresql-Sat.log)

LOG:  entering standby mode
LOG:  redo starts at 0/1C78848
LOG:  consistent recovery state reached at 0/1C7FA30
LOG:  database system is ready to accept read only connections
LOG:  invalid magic number 0000 in log file 0, segment 1, offset 14942208
LOG:  streaming replication successfully connected to primary
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
FATAL:  terminating walreceiver process due to administrator command
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
LOG:  out-of-sequence timeline ID 1 (after 2) in log file 0, segment 1, offset 0
...

Transaction logs: (on the slave — weird, log file 0…2…1 is from January 2012, but 0…2…2 is from December 2011!?)

-bash-4.1$ tree -D pg_xlog/
pg_xlog/
├── [Dec  6  6:27]  000000010000000000000001
├── [Jan 20 21:33]  000000020000000000000001
├── [Dec 21  5:29]  000000020000000000000002
├── [Dec  6  6:30]  00000002.history
└── [Dec  6  6:30]  archive_status
    └── [Dec  6  6:30]  00000002.history.ready

1 directory, 5 files
-bash-4.1$ cat pg_xlog/00000002.history 
1       000000010000000000000001        no recovery target specified

Transaction logs: (on the master)

-bash-4.1$ tree -D pg_xlog/
pg_xlog/
├── [Dec  6  6:27]  000000010000000000000001
├── [Jan 21  8:55]  000000020000000000000001
├── [Dec 21  5:31]  000000020000000000000002
├── [Dec  6  6:30]  00000002.history
└── [Dec  6  6:30]  archive_status
    └── [Dec  6  6:30]  00000002.history.ready

-bash-4.1$ cat pg_xlog/00000002.history 
1       000000010000000000000001        no recovery target specified

Config files: (on the slave)

---postgresql.conf---
wal_level = hot_standby
...
hot_standby = on
...

---recovery.conf---
standby_mode = 'on'
primary_conninfo = 'host=dw0azewdbpv11danny user=replicator password=...'
recovery_target_timeline = 'latest'

Finally, on the slave, ps aux shows that:

Jan20 0:01 postgres: startup process waiting for 000000020000000000000001

Update one week later:

There seems to be a replication bug for PostgreSQL builds with gcc 4.6.0.
http://archives.postgresql.org/pgsql-general/2011-07/msg00686.php
and
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45d792f70272ed57b932816562f31c2f79426c2a
However I'd guess that that bug would not affect 9.1?

PostgreSQL versions should be identical on master and slave, and SELECT version() looks like this on both master and slave:

-bash-4.1$ psql
psql (9.1.1)
Type "help" for help.

postgres=# SELECT version();
                                                   version                                                    
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.1.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6), 64-bit
(1 row)

Perhaps I should also mention that the current master was a slave, originally, and the current slave was the master, originally. — I did one failover, to test that it worked okay.

(I'll ask this question on the Postgres mailing lists.)

Best Answer

This doesn't sound like the same thing. The GCC version is different and your PostgreSQL version is different. IIRC it was an issue with GCC 4.6 optimizations and was fixed. I don't believe it should have impacted your setup.

More likely something changed which caused this. Perhaps the slave was promoted to a master briefly and then demoted again? That will create problems like this all the time.

The timeline issue suggests to me that this is in fact what happened. It doesn't matter whether you write to a db or not, just completing recovery and promoting is enough to break the timeline.

Related Solutions

PostgreSQL replication: out-of-sequence timeline ID, when former master made slave

I found a solution:

Add recovery_target_timeline = 'latest'
to recovery.conf on the new master (the former slave).

Now replication works fine: the new master chooses the correct timeline.

Here's some info on that parameter, and lots of other parameters in recovery.conf:
http://pgpool.projects.postgresql.org/pgpool-II/doc/recovery.conf.sample

Here's some more info on that parameter:
http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Best Answer

Related Solutions

PostgreSQL replication: out-of-sequence timeline ID, when former master made slave

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Related Question