PostgreSQL corrupted after running pg_resetxlog

barmanpostgresqlpostgresql-9.3

We are using PostgreSQL version 9.3 on Ubuntu 14.04. This PostgreSQL server shared among all our application servers (Odoo), so we made it run on the separate environment.

On Saturday we found disk full issue on this DB server. On our further investigation, we found the backup server(barman) is went off. So all the archive log stay on the database server. This occupied the entire disk. Our database backup server may be stopped working a month back.

By googling we found a solution, that by resetting the pg_xlog files to solve this problem. So we clean the log file using pg_restxlog command. As the forum said the disk cleared and we reboot the server. But no database found :-(. We listed using psql – list command. Nothing worked till now. We are not able to restore the backup from barman too. Then we continue our investigation and we found all the data stay safe under the base folder of the main data path of Postgres.

The steps we execute to reset the log follows.

Try to stop the database server

sudo service postgresql stop
Logged in as Postgres user

sudo su – postgres
Run the reset command.

/usr/lib/postgresql/9.3/bin/pg_resetxlog -f /var/lib/postgresql/9.3/main/
Disable the barman configuration in postgres.conf file to stop the backup process for while.
And reboot the server

File content of /var/lib/postgresql/9.3/main/

postgres@server2:~$ du -h 9.3/main/
12K     9.3/main/pg_notify
28M     9.3/main/base/2735749
73M     9.3/main/base/4172290
46M     9.3/main/base/4410494
81M     9.3/main/base/3002089
43M     9.3/main/base/4282962
47M     9.3/main/base/3377227
130M    9.3/main/base/4098067
44M     9.3/main/base/1682791
58M     9.3/main/base/3377231
4.0K    9.3/main/base/pgsql_tmp
6.1M    9.3/main/base/12030
41M     9.3/main/base/4280118
54M     9.3/main/base/3149391
45M     9.3/main/base/4202614
49M     9.3/main/base/3344071
45M     9.3/main/base/2985056
51M     9.3/main/base/2120822
18G     9.3/main/base/3655712
25M     9.3/main/base/2759574
40M     9.3/main/base/4388978
52M     9.3/main/base/2435773
53M     9.3/main/base/4236740
55M     9.3/main/base/3386464
6.2M    9.3/main/base/12035
201M    9.3/main/base/4112218
54M     9.3/main/base/1625789
635M    9.3/main/base/149656
40M     9.3/main/base/4190162
25M     9.3/main/base/4090019
150M    9.3/main/base/4338686
6.2M    9.3/main/base/1
86M     9.3/main/base/2101485
185M    9.3/main/base/3453985
48M     9.3/main/base/4244883
41M     9.3/main/base/4160039
47M     9.3/main/base/3377180
38M     9.3/main/base/4150310
8.9G    9.3/main/base/2926431
47M     9.3/main/base/1693701
28M     9.3/main/base/4153341
25M     9.3/main/base/2744130
74M     9.3/main/base/2023404
29M     9.3/main/base/3231291
28M     9.3/main/base/2749185
43M     9.3/main/base/4371923
47M     9.3/main/base/3410953
47M     9.3/main/base/4313961
50M     9.3/main/base/4399246
49M     9.3/main/base/3402258
84M     9.3/main/base/3379836
64M     9.3/main/base/2777796
30G     9.3/main/base
5.8M    9.3/main/global
88K     9.3/main/pg_multixact/offsets
256K    9.3/main/pg_multixact/members
348K    9.3/main/pg_multixact
4.0K    9.3/main/pg_xlog/archive_status
33M     9.3/main/pg_xlog
100K    9.3/main/pg_stat_tmp
4.0K    9.3/main/pg_serial
4.0M    9.3/main/pg_clog
4.0K    9.3/main/pg_stat
52K     9.3/main/pg_subtrans
4.0K    9.3/main/pg_tblspc
4.0K    9.3/main/pg_twophase
4.0K    9.3/main/pg_snapshots
30G     9.3/main/

Best Answer

I guess it is too late now. If you use pg_resetxlog, you need to be extremely careful and, most importantly, know what you are doing.

Thanks to PostgreSQL's robustness, all you had to do in that case was to free space in the Barman server, for example by deleting the oldest backup in the catalogue. Then, once space was reclaimed, PostgreSQL could have resumed shipping WAL files and automatically recovered.

I know it is too late for you, but I am hoping that my reply will be able to help somebody in the future and prevent them from running pg_resetxlog.

Related Solutions

Postgresql – pg_upgrade unrecognized configuration parameter “unix_socket_directory”

I hacked the problem by running (as root):

mv /usr/bin/pg_ctl{,-orig}
echo '#!/bin/bash' > /usr/bin/pg_ctl
echo '"$0"-orig "${@/unix_socket_directory/unix_socket_directories}"' >> \
     /usr/bin/pg_ctl
chmod +x /usr/bin/pg_ctl

Run pg_upgrade as intended, then undo the hack:

    mv -f /usr/bin/pg_ctl{-orig,}

The problem is that pg_upgrade executes the program pg_ctrl with arguments that specify files in the old "unix_socket_directory" rather than the new "unix_socket_directories" (note the second is plural). This hack renames the original /usr/bin/pg_ctl to /usr/bin/pg_ctl-orig, and then creates a shell script in its place that simply calls the original pg_ctl program, passing all arguments with any strings "unix_socket_directory" changed to "unix_socket_directories".

In bash, one can change a portion of a string, say from bar to baz in a variable $foo, by using ${foo/bar/baz} (note this does not change the variable, but rather returns the variable's modified contents). Arrays can also be used with ${x/y/z} to retrieve an array with all of its contents replaced, all at once. The variable $@ is an array that contains all arguments passed to the program/script/function, so the new pg_ctl script executes the old one with all arguments changed from the old directory name to the new one.

PostgreSQL Slave has more files in pg_xlog than /wal_archive

how is it possible for the Slave to have more pg_xlog/ log files than the Master?

The whole point of archiving WAL on the master to some external location is to let the master then delete it to free space in its pg_xlog, while replicas might still need it.

A replica can have more archives in pg_xlog than the master, and older ones, if it's lagging behind the master due to failure to keep up with replay. However, with pg_standby that shouldn't happen - the archive might contain more xlogs, but the replica should only be reading them on-demand.

It's hard to be specific, because you've given a broad description of the issue rather than actual directory listings, and haven't explained the exact steps you followed to set up the replica. Or shown the exact log file output from the replica. So the best I can do is "it sounds like the replica setup is broken somehw".

to resync the servers in warm standby mode: do I have to do pg_basebackup again (to essentially copy Master's /data and /pg_xlog directory) to the Slave?

Assuming that here /data is the main datadir, containing global, base, pg_clog, etc, and that pg_xlog is the transaction logs from a different disk: Yes, that's right.

You must use the pg_basebackup command, though, or follow the instructions in the manual for correct file system level copies using pg_start_backup() and rsync/cp.

You also have to make sure you've stopped the replica first. Overwriting its datadir while it's running will make it quite upset.

Streaming replication vs warm standby

Hot vs warm standby is orthogonal to streaming vs log shipping replication.

What you're trying to do is use log shipping instead of streaming replication. It doesn't matter for this purpose if the replica is a hot standby or a warm standby, i.e. whether or not it's accepting queries.

Personally I recommend using both methods - use streaming, and fall back to log shipping if there's a problem with streaming. PostgreSQL does this automatically if both are configured.

Best Answer

Related Solutions

Postgresql – pg_upgrade unrecognized configuration parameter “unix_socket_directory”

PostgreSQL Slave has more files in pg_xlog than /wal_archive

Related Question