PostgreSQL – Resolving WAL Sender Replication Timeout During pg_basebackup

postgresqlreplication

Let me start with the caveat that I am still green with Postgres.

I am working on a postgres 9.2 Active/Standby cluster on Debian wheezy for an application, based off of the ClusterLabs pgsql cluster documentation.

In the lab I am able to get this working without a problem. But on the production cluster I'm building, I keep running into a problem.

I brought the database files over from the current single production postgres server. By this I mean I shutdown postgres and tar-ed up the data directory and copied it over the the cluster's Master node. I put the files in place, set the permissions, and was able to start-up postgres on the Master via corosync just fine.

In preparing the slave, I used the pg_basebackup tool to bring the database over from the Master and this is where I keep having issues. As it is transferring, at about 57% I see the error:

$ pg_basebackup -h db-master -U u_repl -D /db/data/postgresql/9.2/main/ -X stream -P
pg_basebackup: could not receive data from WAL stream: SSL connection has been closed unexpectedly
176472/176472 kB (100%), 1/1 tablespace
pg_basebackup: child process exited with error 1`

And on the server, I see:

2016-04-06 21:05:31 UTC LOG:  terminating walsender process due to replication timeout

But the transfer doesn't stop and keeps going to completion.

I found this question here on stackexchange about setting "ssl_renegotiation_limit" to 0, but this didn't make much difference.

Anyone have any ideas? I am completely baffled as to why this would error, but keep on going just fine. It is the same procedure I used in the lab setup… the only difference is that the production database is much bigger in size.

Thoughts??
Thank you kindly! -Peter.

Best Answer

Many thanks to Albe Laurenz from the pgsql-admin mailing list.

The server error message means that the client did not send a status update within wal_sender_timeout milliseconds, see documentation.

The basebackup needs to complete before this wal_sender_timeout period, else the server resets the connection.

Side note, I am running 9.2 so this parameter is called replication_timeout in the older version.

Related Question