Postgresql – pg_xlog not recycling WAL files on slave server

postgresqlpostgresql-9.3replication

I've setup streaming replication with Postgres 9.3
My problem is that on the slave server the pg_xlog folder just gets fuller and fuller and WAL files are not getting recycled.

The slave server has the following (relevant) values in postgresql.conf:

wal_keep_segments = 150
hot_standby = on
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = off
#archive_command = ''

My initial replication command was:

pg_basebackup  --xlog-method=stream -h <master-ip> -D .  --username=replication --password

So I guess my WAL files are OK.

Here is my slave server startup log:

2017-05-08 09:55:31 IDT LOG:  database system was shut down in recovery at 2017-05-08 09:55:19 IDT
2017-05-08 09:55:31 IDT LOG:  entering standby mode
2017-05-08 09:55:31 IDT LOG:  redo starts at 361/C76DD3E8
2017-05-08 09:55:31 IDT LOG:  consistent recovery state reached at 361/C89A8278
2017-05-08 09:55:31 IDT LOG:  database system is ready to accept read only connections
2017-05-08 09:55:31 IDT LOG:  record with zero length at 361/C89A8278
2017-05-08 09:55:31 IDT LOG:  started streaming WAL from primary at 361/C8000000 on timeline 1
2017-05-08 09:55:32 IDT LOG:  incomplete startup packet
2017-05-08 09:58:34 IDT LOG:  received SIGHUP, reloading configuration files
2017-05-08 09:58:34 IDT LOG:  parameter "checkpoint_completion_target" changed to "0.9"

I even tried to copy older WAL files from master server manually to slave but that also didn't help.

What am I doing wrong? How can I stop the pg_xlog folder from growing indefinitely?
Is it related to the "incomplete startup packet" log message?

One last thing: under the pg_xlog\archive_status folder all the WAL files are with .done suffix.

Appreciate any help I can get on this.

Edit:

I enabled log_checkpoints in postgresql.conf.

Here are the relevant log entries since I enabled it:

2017-05-12 08:43:11 IDT LOG:  parameter "log_checkpoints" changed to "on"
2017-05-12 08:43:24 IDT LOG:  checkpoint complete: wrote 2128 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=189.240 s, sync=0.167 s, total=189.549 s; sync files=745, longest=0.010 s, average=0.000 s
2017-05-12 08:45:15 IDT LOG:  checkpoint starting: time
2017-05-12 08:48:46 IDT LOG:  checkpoint complete: wrote 15175 buffers (6.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.078 s, sync=1.454 s, total=210.617 s; sync files=769, longest=0.032 s, average=0.001 s
2017-05-12 08:50:15 IDT LOG:  checkpoint starting: time
2017-05-12 08:53:45 IDT LOG:  checkpoint complete: wrote 2480 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.162 s, sync=0.991 s, total=210.253 s; sync files=663, longest=0.076 s, average=0.001 s

Best Answer

The problem seems to have been resolved.

Apparently I had hardware issues on the master server.
I was able to perform full pg_dump and re-index my DB so I was pretty sure I did not have any data integrity issues.

But when looking at the master server logs after I've enabled log_checkpoints in the config - a few minutes before the slave server stopped performing checkpoints I saw the following message:

IDT ERROR:  failed to re-find parent key in index "<table_name>_id_udx" for split pages 17/18

After seeing that - I decided to switch hosting provider and moved my DB to a new server. Since then (almost a week now) - everything has been running smoothly replication and checkpoints are running as expected.

I really hope this will help other people - but when something like this is happening - always be advised that this issue might be caused by data integrity/hardware issues.