It sounds like PostgreSQL is set to recover from log shipping rather than by connecting as a replication user. Please double and triple check your recovery.conf and if that doesn't work, then post it here.
The approach you are taking is a valid approach though, and it means that the recovery will just wait for the next segment until it arrives creating the message you are seeing, but it must be transferred using whatever recovery command you have configured in the master's postgresql.conf.
Most likely what you're seeing is a huge checkpoint_segments
value and long checkpoint_timeout
; alternately, they might have set wal_keep_segments
to a very large value if it's supposed to support streaming replication.
You can force a checkpoint with the CHECKPOINT
command. This may stall the database for some time if it has accumulated a huge amount of WAL and hasn't been background-writing it. If checkpoint_completion_target
is low (less than 0.8 or 0.9) then there's likely to be a big backlog of work to do at checkpoint time. Be prepared for the database to become slow and unresponsive during the checkpoint. You cannot abort a checkpoint once it begins by normal means; you can crash the database and restart it, but that just puts you back to where you were.
I'm not certain, but I have the feeling a checkpoint could also result in growth of the main database - and do so before any space is freed in the WAL, if it is at all. So a checkpoint could potentially trigger you running out of space, something that's very hard to recover from without adding more storage at least temporarily.
Now would be a very good time to get a proper backup of the database - use pg_dump -Fc dbname
to dump each database, and pg_dumpall --globals-only
to dump user definitions etc.
If you can afford the downtime, stop the database and take a file-system level copy of the entire data directory (the folder containing pg_xlog
, pg_clog
, global
, base
, etc). Do not do this while the server is running and do not omit any files or folders, they are all important (well, except pg_log
, but it's a good idea to keep the text logs anyway).
If you'd like more specific comment on the likely cause (and so I can be more confident in my hypothesis is) you can run the following queries and paste their output into your answer (in a code-indented block) then comment so I'm notified:
SELECT version();
SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');
It is possible that setting checkpoint_completion_target = 1
then stopping and restarting the DB might cause it to start aggressively writing out queued up WAL. It won't free any until it does a checkpoint, but you could force one once write activity slows down (as measured with sar, iostat, etc). I have not tested to see if checkpoint_completion_target
affects already-written WAL when changed in a restart; consider testing this on a throwaway test PostgreSQL you initdb
on another machine first.
Backups have nothing to do with WAL retention and growth; it isn't backup related.
See:
Best Answer
To answer your questions:
Set the
checkpoint_timeout
lower, as you did, to 30 seconds or so. I made sure to enablelog_checkpoints
in mypostgresql.conf
and used check_postgres.pl to monitor the number of segments and replication lag. You can also setcheckpoint_completion_target=0
to adjust the checkpointing behavior as well. But really, on a replica,checkpoint_timeout
is really the only way to keep thepg_xlog
directory down in size on a replica. Additional reference here: pgsql bugs #7801Adjusting
checkpoint_timeout
is unfortunately the best option. There are some tips on dealing with pg_xlog filling up in this blog post: Solving pg_xlog out of disk space problems, but these mostly apply to a primary that's run out of space, and the panic related to that.PostgreSQL WAL segments take up approximately this much room, according to the documentation for 9.1 WAL Configuration:
(2 + checkpoint_completion_target) * checkpoint_segments + 1
orcheckpoint_segments + wal_keep_segments + 1
filesSo if you do the math like this for example, with a
checkpoint_completion_target=0.9
andcheckpoint_segments = 50
:(2 + 0.9) * 50 + 1 = 146 * 16MB = 2336 MB
Which is only about 45% of your
5 GB
of yourpg_xlog
filesystem. So long as you aren't getting checkpoint warnings in your logs, and the space isn't in danger of being completely used up, 5 GB should be fine. Using these formulas, you should be able to appropriately size yourpg_xlog
filesystem for your usage.Greg Smith has a lovely chart in PostgreSQL 9.0 High Performance, referenced here: Server Configuration Tuning Practices. I would also recommend the book as a reference, even though it's a bit old, because a lot of the information in it is still quite applicable.
Hope that helps. =)