I've setup streaming replication with Postgres 9.3
My problem is that on the slave server the pg_xlog
folder just gets fuller and fuller and WAL files are not getting recycled.
The slave server has the following (relevant) values in postgresql.conf
:
wal_keep_segments = 150
hot_standby = on
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = off
#archive_command = ''
My initial replication command was:
pg_basebackup --xlog-method=stream -h <master-ip> -D . --username=replication --password
So I guess my WAL files are OK.
Here is my slave server startup log:
2017-05-08 09:55:31 IDT LOG: database system was shut down in recovery at 2017-05-08 09:55:19 IDT
2017-05-08 09:55:31 IDT LOG: entering standby mode
2017-05-08 09:55:31 IDT LOG: redo starts at 361/C76DD3E8
2017-05-08 09:55:31 IDT LOG: consistent recovery state reached at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: database system is ready to accept read only connections
2017-05-08 09:55:31 IDT LOG: record with zero length at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: started streaming WAL from primary at 361/C8000000 on timeline 1
2017-05-08 09:55:32 IDT LOG: incomplete startup packet
2017-05-08 09:58:34 IDT LOG: received SIGHUP, reloading configuration files
2017-05-08 09:58:34 IDT LOG: parameter "checkpoint_completion_target" changed to "0.9"
I even tried to copy older WAL files from master server manually to slave but that also didn't help.
What am I doing wrong? How can I stop the pg_xlog
folder from growing indefinitely?
Is it related to the "incomplete startup packet" log message?
One last thing: under the pg_xlog\archive_status
folder all the WAL files are with .done
suffix.
Appreciate any help I can get on this.
Edit:
I enabled log_checkpoints
in postgresql.conf
.
Here are the relevant log entries since I enabled it:
2017-05-12 08:43:11 IDT LOG: parameter "log_checkpoints" changed to "on"
2017-05-12 08:43:24 IDT LOG: checkpoint complete: wrote 2128 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=189.240 s, sync=0.167 s, total=189.549 s; sync files=745, longest=0.010 s, average=0.000 s
2017-05-12 08:45:15 IDT LOG: checkpoint starting: time
2017-05-12 08:48:46 IDT LOG: checkpoint complete: wrote 15175 buffers (6.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.078 s, sync=1.454 s, total=210.617 s; sync files=769, longest=0.032 s, average=0.001 s
2017-05-12 08:50:15 IDT LOG: checkpoint starting: time
2017-05-12 08:53:45 IDT LOG: checkpoint complete: wrote 2480 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.162 s, sync=0.991 s, total=210.253 s; sync files=663, longest=0.076 s, average=0.001 s
Best Answer
The problem seems to have been resolved.
Apparently I had hardware issues on the master server.
I was able to perform full pg_dump and re-index my DB so I was pretty sure I did not have any data integrity issues.
But when looking at the master server logs after I've enabled log_checkpoints in the config - a few minutes before the slave server stopped performing checkpoints I saw the following message:
After seeing that - I decided to switch hosting provider and moved my DB to a new server. Since then (almost a week now) - everything has been running smoothly replication and checkpoints are running as expected.
I really hope this will help other people - but when something like this is happening - always be advised that this issue might be caused by data integrity/hardware issues.