I have the following problem: a "vertical" Linux distribution (Sophos UMT) comes with PostgreSQL 9.2 to store its configuration. Unfortunately, since the last update, it seems that the transaction logs (WAL) of some instances are growing without ever being flushed. This causes the pg_xlog folder to grow up to be several order of magnitude larger than the base folder.
I'm now in a delicate situation: due to the excessive growth of the WAL files, the disk of one of these machines (a VM) will get full before Monday. I have already opened a support case with the vendor but, so far, they aren't being very helpful (they suggest we rebuild the VM with larger disks).
This database is never backed up because the software is performing backups in a different way (it has its own backup procedure and sends backup files by email) and I suppose that this is the reason why the WAFs are growing so much.
I'm afraid that I'm far from being a PostgreSQL expert so it's very likely I am asking a silly or obvious question but, what is the procedure for requesting the WAL files to be flushed ?
Ideally, I'm looking for a procedure that will allow me to flush these WAL files on the problematic system in order to buy myself enough time to get the vendor to issue a better fix.
Edit:
As requested, here is the output of the SELECT version();
query:
PostgreSQL 9.2.4 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit
(1 row)
And the SELECT name, current_setting(name), source
query
FROM pg_settings
WHERE source NOT IN ('default', 'override');
hot_standby | on | configuration file
listen_addresses | * | configuration file
log_destination | syslog | configuration file
log_min_duration_statement | -1 | configuration file
log_min_error_statement | error | configuration file
log_min_messages | notice | configuration file
maintenance_work_mem | 512MB | configuration file
max_connections | 300 | configuration file
max_files_per_process | 1000 | configuration file
max_prepared_transactions | 0 | configuration file
max_stack_depth | 2MB | configuration file
max_standby_streaming_delay | 10s | configuration file
max_wal_senders | 10 | configuration file
password_encryption | on | configuration file
pg_stat_statements.max | 1000 | configuration file
pg_stat_statements.save | on | configuration file
pg_stat_statements.track | all | configuration file
pg_stat_statements.track_utility | off | configuration file
port | 5432 | configuration file
random_page_cost | 2 | configuration file
replication_timeout | 1min | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 512MB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | off | configuration file
stats_temp_directory | pg_stat_tmp | configuration file
superuser_reserved_connections | 20 | configuration file
synchronous_commit | local | configuration file
syslog_facility | local0 | configuration file
syslog_ident | postgres | configuration file
temp_buffers | 256MB | configuration file
temp_file_limit | -1 | configuration file
TimeZone | GMT | configuration file
timezone_abbreviations | AlmostAll | configuration file
track_activities | on | configuration file
track_activity_query_size | 4096 | configuration file
track_counts | on | configuration file
track_functions | none | configuration file
track_io_timing | on | configuration file
unix_socket_directory | /var/run/postgresql | configuration file
unix_socket_group | postgres | configuration file
unix_socket_permissions | 0777 | configuration file
update_process_title | on | configuration file
vacuum_defer_cleanup_age | 0 | configuration file
wal_buffers | 16MB | configuration file
wal_keep_segments | 100 | configuration file
wal_level | hot_standby | configuration file
wal_receiver_status_interval | 5s | configuration file
work_mem | 512MB | configuration file
(69 rows)
Edit2
We finally reinstalled the whole server (as requested by Sophos support) but using the previous version and a larger disk. Apparently, the older version is using far less space for the WAL than the new one.
Out of curiosity, I ran the check for version and 7non-default pgsql parameters and I got quite different results:
PostgreSQL 8.4.14 on i686-pc-linux-gnu, compiled by GCC gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit
and
name | current_setting | source
---------------------------------+-----------------+----------------------
autovacuum_analyze_scale_factor | 0.0005 | configuration file
checkpoint_segments | 12 | configuration file
checkpoint_warning | 0 | configuration file
escape_string_warning | off | configuration file
fsync | on | configuration file
listen_addresses | * | configuration file
log_destination | syslog | configuration file
log_timezone | Europe/Zurich | command line
maintenance_work_mem | 1GB | configuration file
max_connections | 300 | configuration file
max_stack_depth | 2MB | environment variable
port | 5432 | configuration file
shared_buffers | 32MB | configuration file
standard_conforming_strings | off | configuration file
syslog_facility | local0 | configuration file
syslog_ident | postgres | configuration file
temp_buffers | 1024 | configuration file
TimeZone | UTC | configuration file
timezone_abbreviations | AlmostAll | configuration file
work_mem | 512MB | configuration file
(20 rows)
It looks to me like there was quite a lot of changes in between these two versions.
Best Answer
Most likely what you're seeing is a huge
checkpoint_segments
value and longcheckpoint_timeout
; alternately, they might have setwal_keep_segments
to a very large value if it's supposed to support streaming replication.You can force a checkpoint with the
CHECKPOINT
command. This may stall the database for some time if it has accumulated a huge amount of WAL and hasn't been background-writing it. Ifcheckpoint_completion_target
is low (less than 0.8 or 0.9) then there's likely to be a big backlog of work to do at checkpoint time. Be prepared for the database to become slow and unresponsive during the checkpoint. You cannot abort a checkpoint once it begins by normal means; you can crash the database and restart it, but that just puts you back to where you were.I'm not certain, but I have the feeling a checkpoint could also result in growth of the main database - and do so before any space is freed in the WAL, if it is at all. So a checkpoint could potentially trigger you running out of space, something that's very hard to recover from without adding more storage at least temporarily.
Now would be a very good time to get a proper backup of the database - use
pg_dump -Fc dbname
to dump each database, andpg_dumpall --globals-only
to dump user definitions etc.If you can afford the downtime, stop the database and take a file-system level copy of the entire data directory (the folder containing
pg_xlog
,pg_clog
,global
,base
, etc). Do not do this while the server is running and do not omit any files or folders, they are all important (well, exceptpg_log
, but it's a good idea to keep the text logs anyway).If you'd like more specific comment on the likely cause (and so I can be more confident in my hypothesis is) you can run the following queries and paste their output into your answer (in a code-indented block) then comment so I'm notified:
It is possible that setting
checkpoint_completion_target = 1
then stopping and restarting the DB might cause it to start aggressively writing out queued up WAL. It won't free any until it does a checkpoint, but you could force one once write activity slows down (as measured with sar, iostat, etc). I have not tested to see ifcheckpoint_completion_target
affects already-written WAL when changed in a restart; consider testing this on a throwaway test PostgreSQL youinitdb
on another machine first.Backups have nothing to do with WAL retention and growth; it isn't backup related.
See:
checkpoint_completion_target