Postgresql configuring wal_archive directory

postgresqlreplicationwrite-ahead-logging

I had a problem with PostgreSQL server.

I took over PostgreSQL server from my former colleague, and today my server crashed due to filled up disk space with wal_archives.

Usually wal segments were put in root directory /wal_archive/, I checked it – its empty. Some time ago I checked it and it had some segments…

I dug deeper and find out that all wal segments were stored in pg_data/pg_xlog directory.

How could this happen? While postgresql.conf is set up like this (showing only enabled segment of 'write ahead log' rows)

wal_level = hot_standby                 # minimal, archive, hot_standby, or 
archive_mode = on               # allows archiving to be done
archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f'  

Why my wal_archive directory now is empty wwhile config settings showing different information?

The wal_archive directory was owned by root user
pg_hba.conf file does not have any replication user – I assume that there is no replication on the server also. There are no files like recovery.conf anywhere.

cp: cannot stat 'pg_xlog/000000010000033C00000081': No such file or directory
2018-12-05 14:37:54 EET [7540-1860] LOG: archive command failed with exit code 1
2018-12-05 14:37:54 EET [7540-1861] DETAIL: The failed archive command was: test ! -f /wal_archive/000000010000033C00000081 && cp pg_xlog/000000010000033C00000081 /wal_archive/000000010000033C00000081
2018-12-05 14:37:54 EET [7540-1862] WARNING: archiving transaction log file "000000010000033C00000081" failed too many times, will try again later

Also due to panic, I accidentally deleted big chunk of wal segments in xlog directory, to fire up server. Everything is working ok now, but I want to ensure that there will be no problems in the future.

Best Answer

"pg_xlog" is the working directory the WAL files are stored in while PostgreSQL itself still might need the WAL files, i.e. to recovery from a power outage or other "soft" crash. They are removed from there only once PostgreSQL is sure they are no longer needed for that purpose, and once "archive_command" has successfully completed.

We can't tell what happened to your "/wal_archive/", but here is a plausible story:

You had some kind of network filesystem (NFS, CIFS, etc.) mounted using "/wal_archive" as a mount point. That mount failed and never got re-established. That means the original filesystem containing your WAL files is still floating out there somewhere, but the computer no longer know how to access it. You see the former /wal_archive mount point as an empty directory because that is how mounts work on Linux. The user running your postgres server probably can't write to this residual directory, because when the mount was lost so were the permissions. So the archive_command has been failing with permissions errors, but you never noticed that because you weren't looking in the server log. It repeatedly tried to archive the WAL file out of pg_xlog, but repeated failed. Since it failed, it could never remove the file from pg_xlog, until eventually it filled up. In a panic, you removed some of the files from pg_xlog. Now it is still trying to archive that one file, but it is now failing for a different reason, the file is no longer in pg_xlog, as demonstrated by the error message "cp: cannot stat 'pg_xlog/000000010000033C00000081'".

Now that you have it working again, either files should be showing up in the /wal_archive/, or there should be errors in the server log file telling you that the archive_command failed.

Note that the archive_command you are using is not perfectly safe. It does not fsync the file contents just copied into /wal_archive/ before it reports success, so a poorly-timed crash could mean that the just-archived file doesn't actually exist anywhere. You might want to switch to a canned solutions like "pgbackrest" or "barman", or using streaming replication, to avoid that kind of pitfall.

Also note that any backups you think you have are likely to be invalid, unless you are also taking logical backups (like with pg_dump). You better take new ones and test them.