Postgresql – How to request a flush of the postgresql transaction logs

postgresql

I have the following problem: a "vertical" Linux distribution (Sophos UMT) comes with PostgreSQL 9.2 to store its configuration. Unfortunately, since the last update, it seems that the transaction logs (WAL) of some instances are growing without ever being flushed. This causes the pg_xlog folder to grow up to be several order of magnitude larger than the base folder.

I'm now in a delicate situation: due to the excessive growth of the WAL files, the disk of one of these machines (a VM) will get full before Monday. I have already opened a support case with the vendor but, so far, they aren't being very helpful (they suggest we rebuild the VM with larger disks).

This database is never backed up because the software is performing backups in a different way (it has its own backup procedure and sends backup files by email) and I suppose that this is the reason why the WAFs are growing so much.

I'm afraid that I'm far from being a PostgreSQL expert so it's very likely I am asking a silly or obvious question but, what is the procedure for requesting the WAL files to be flushed ?

Ideally, I'm looking for a procedure that will allow me to flush these WAL files on the problematic system in order to buy myself enough time to get the vendor to issue a better fix.

Edit:
As requested, here is the output of the SELECT version(); query:

 PostgreSQL 9.2.4 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit

(1 row)

And the SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');
query

 hot_standby                      | on                  | configuration file
 listen_addresses                 | *                   | configuration file
 log_destination                  | syslog              | configuration file
 log_min_duration_statement       | -1                  | configuration file
 log_min_error_statement          | error               | configuration file
 log_min_messages                 | notice              | configuration file
 maintenance_work_mem             | 512MB               | configuration file
 max_connections                  | 300                 | configuration file
 max_files_per_process            | 1000                | configuration file
 max_prepared_transactions        | 0                   | configuration file
 max_stack_depth                  | 2MB                 | configuration file
 max_standby_streaming_delay      | 10s                 | configuration file
 max_wal_senders                  | 10                  | configuration file
 password_encryption              | on                  | configuration file
 pg_stat_statements.max           | 1000                | configuration file
 pg_stat_statements.save          | on                  | configuration file
 pg_stat_statements.track         | all                 | configuration file
 pg_stat_statements.track_utility | off                 | configuration file
 port                             | 5432                | configuration file
 random_page_cost                 | 2                   | configuration file
 replication_timeout              | 1min                | configuration file
 seq_page_cost                    | 1                   | configuration file
 shared_buffers                   | 512MB               | configuration file
 shared_preload_libraries         | pg_stat_statements  | configuration file
 ssl                              | off                 | configuration file
 stats_temp_directory             | pg_stat_tmp         | configuration file
 superuser_reserved_connections   | 20                  | configuration file
 synchronous_commit               | local               | configuration file
 syslog_facility                  | local0              | configuration file
 syslog_ident                     | postgres            | configuration file
 temp_buffers                     | 256MB               | configuration file
 temp_file_limit                  | -1                  | configuration file
 TimeZone                         | GMT                 | configuration file
 timezone_abbreviations           | AlmostAll           | configuration file
 track_activities                 | on                  | configuration file
 track_activity_query_size        | 4096                | configuration file
 track_counts                     | on                  | configuration file
 track_functions                  | none                | configuration file
 track_io_timing                  | on                  | configuration file
 unix_socket_directory            | /var/run/postgresql | configuration file
 unix_socket_group                | postgres            | configuration file
 unix_socket_permissions          | 0777                | configuration file
 update_process_title             | on                  | configuration file
 vacuum_defer_cleanup_age         | 0                   | configuration file
 wal_buffers                      | 16MB                | configuration file
 wal_keep_segments                | 100                 | configuration file
 wal_level                        | hot_standby         | configuration file
 wal_receiver_status_interval     | 5s                  | configuration file
 work_mem                         | 512MB               | configuration file
(69 rows)

Edit2

We finally reinstalled the whole server (as requested by Sophos support) but using the previous version and a larger disk. Apparently, the older version is using far less space for the WAL than the new one.

Out of curiosity, I ran the check for version and 7non-default pgsql parameters and I got quite different results:

PostgreSQL 8.4.14 on i686-pc-linux-gnu, compiled by GCC gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 32-bit

and

              name               | current_setting |        source
---------------------------------+-----------------+----------------------
 autovacuum_analyze_scale_factor | 0.0005          | configuration file
 checkpoint_segments             | 12              | configuration file
 checkpoint_warning              | 0               | configuration file
 escape_string_warning           | off             | configuration file
 fsync                           | on              | configuration file
 listen_addresses                | *               | configuration file
 log_destination                 | syslog          | configuration file
 log_timezone                    | Europe/Zurich   | command line
 maintenance_work_mem            | 1GB             | configuration file
 max_connections                 | 300             | configuration file
 max_stack_depth                 | 2MB             | environment variable
 port                            | 5432            | configuration file
 shared_buffers                  | 32MB            | configuration file
 standard_conforming_strings     | off             | configuration file
 syslog_facility                 | local0          | configuration file
 syslog_ident                    | postgres        | configuration file
 temp_buffers                    | 1024            | configuration file
 TimeZone                        | UTC             | configuration file
 timezone_abbreviations          | AlmostAll       | configuration file
 work_mem                        | 512MB           | configuration file
(20 rows)

It looks to me like there was quite a lot of changes in between these two versions.

Best Answer

Most likely what you're seeing is a huge checkpoint_segments value and long checkpoint_timeout; alternately, they might have set wal_keep_segments to a very large value if it's supposed to support streaming replication.

You can force a checkpoint with the CHECKPOINT command. This may stall the database for some time if it has accumulated a huge amount of WAL and hasn't been background-writing it. If checkpoint_completion_target is low (less than 0.8 or 0.9) then there's likely to be a big backlog of work to do at checkpoint time. Be prepared for the database to become slow and unresponsive during the checkpoint. You cannot abort a checkpoint once it begins by normal means; you can crash the database and restart it, but that just puts you back to where you were.

I'm not certain, but I have the feeling a checkpoint could also result in growth of the main database - and do so before any space is freed in the WAL, if it is at all. So a checkpoint could potentially trigger you running out of space, something that's very hard to recover from without adding more storage at least temporarily.

Now would be a very good time to get a proper backup of the database - use pg_dump -Fc dbname to dump each database, and pg_dumpall --globals-only to dump user definitions etc.

If you can afford the downtime, stop the database and take a file-system level copy of the entire data directory (the folder containing pg_xlog, pg_clog, global, base, etc). Do not do this while the server is running and do not omit any files or folders, they are all important (well, except pg_log, but it's a good idea to keep the text logs anyway).

If you'd like more specific comment on the likely cause (and so I can be more confident in my hypothesis is) you can run the following queries and paste their output into your answer (in a code-indented block) then comment so I'm notified:

SELECT version();

SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');

It is possible that setting checkpoint_completion_target = 1 then stopping and restarting the DB might cause it to start aggressively writing out queued up WAL. It won't free any until it does a checkpoint, but you could force one once write activity slows down (as measured with sar, iostat, etc). I have not tested to see if checkpoint_completion_target affects already-written WAL when changed in a restart; consider testing this on a throwaway test PostgreSQL you initdb on another machine first.

Backups have nothing to do with WAL retention and growth; it isn't backup related.

See: