PostgreSQL – Docker Incorrect Resource Manager Data Checksum in Record

postgresqlreplication

I'm running PostgreSQL 9.3.9, Docker 1.8.2 and Ubuntu 14.04. I have an issue where my hot standby keeps failing with the following error message:

incorrect resource manager data checksum in record at 46F/6A7B6D28

After turning on DEBUG2, this is now the log I see:

2015-10-02 06:56:34.033 UTCDEBUG:  sending write 477/1E9C9990 flush 477/1E9C6700 apply 477/1E9C6700
2015-10-02 06:56:34.078 UTCDEBUG:  sending write 477/1E9C9990 flush 477/1E9C9990 apply 477/1E9C6700
2015-10-02 06:56:34.078 UTCDEBUG:  sendtime 2015-10-02 06:56:34.027356+00 receipttime 2015-10-02 06:56:34.078378+00 replication apply delay 0 ms transfer latency 51 ms
2015-10-02 06:56:34.078 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:34.078 UTCDEBUG:  sending write 477/1E9CE560 flush 477/1E9C9990 apply 477/1E9C8488
2015-10-02 06:56:34.095 UTCDEBUG:  sending write 477/1E9CE560 flush 477/1E9CE560 apply 477/1E9C8488
2015-10-02 06:56:34.095 UTCFATAL:  terminating walreceiver process due to administrator command
2015-10-02 06:56:34.195 UTCDEBUG:  switched WAL source from stream to archive after failure
2015-10-02 06:56:34.195 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:34.195 UTCDEBUG:  switched WAL source from archive to stream after failure
2015-10-02 06:56:34.196 UTCLOG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:39.200 UTCDEBUG:  switched WAL source from stream to archive after failure
2015-10-02 06:56:39.200 UTCDEBUG:  incorrect resource manager data checksum in record at 477/1E9C8488
2015-10-02 06:56:39.200 UTCDEBUG:  switched WAL source from archive to stream after failure

Or this log upon failure:

2015-10-02 00:55:19.838 UTCDEBUG:  sendtime 2015-10-02 00:55:19.836191+00 receipttime 2015-10-02 00:55:19.838961+00 replication apply delay 0 ms transfer latency 2 ms
2015-10-02 00:55:19.839 UTCDEBUG:  sending write 476/E48058A0 flush 476/E4805798 apply 476/E4805798
2015-10-02 00:55:19.878 UTCDEBUG:  attempting to remove WAL segments older than log file 0000000000000472000000FB
2015-10-02 00:55:19.878 UTCDEBUG:  sending write 476/E48058A0 flush 476/E48058A0 apply 476/E4805798
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0000
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0002
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0001
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0000
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0004
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0002
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0001
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0003
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/members/0005
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0000
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0002
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_multixact/offsets/0001
2015-10-02 00:55:19.879 UTCDEBUG:  SlruScanDirectory invoking callback on pg_subtrans/0D6D
2015-10-02 00:55:19.879 UTCDEBUG:  recovery restart point at 476/E410F7E0
2015-10-02 00:55:19.879 UTCDETAIL:  last completed transaction was at log time 2015-10-02 00:55:19.835728+00
2015-10-02 00:55:20.177 UTCDEBUG:  sendtime 2015-10-02 00:55:20.174799+00 receipttime 2015-10-02 00:55:20.177888+00 replication apply delay 0 ms transfer latency 3 ms
2015-10-02 00:55:20.177 UTCDEBUG:  sending write 476/E4806570 flush 476/E48058A0 apply 476/E48058A0
2015-10-02 00:55:20.211 UTCDEBUG:  sending write 476/E4806570 flush 476/E4806570 apply 476/E48058A0
2015-10-02 00:55:20.212 UTCDEBUG:  sendtime 2015-10-02 00:55:20.181054+00 receipttime 2015-10-02 00:55:20.212069+00 replication apply delay 0 ms transfer latency 31 ms
2015-10-02 00:55:20.212 UTCDEBUG:  sending write 476/E4809808 flush 476/E4806570 apply 476/E4806570
2015-10-02 00:55:20.254 UTCDEBUG:  sending write 476/E4809808 flush 476/E4809808 apply 476/E4806570
2015-10-02 00:55:20.254 UTCDEBUG:  sendtime 2015-10-02 00:55:20.182113+00 receipttime 2015-10-02 00:55:20.254256+00 replication apply delay 0 ms transfer latency 72 ms
2015-10-02 00:55:20.254 UTCDEBUG:  sendtime 2015-10-02 00:55:20.195848+00 receipttime 2015-10-02 00:55:20.254304+00 replication apply delay 0 ms transfer latency 58 ms
2015-10-02 00:55:20.254 UTCDEBUG:  sendtime 2015-10-02 00:55:20.226562+00 receipttime 2015-10-02 00:55:20.254326+00 replication apply delay 0 ms transfer latency 27 ms
2015-10-02 00:55:20.254 UTCDEBUG:  sending write 476/E480F1D0 flush 476/E4809808 apply 476/E4809808
2015-10-02 00:55:20.271 UTCDEBUG:  sending write 476/E480F1D0 flush 476/E480F1D0 apply 476/E4809808
2015-10-02 00:55:20.479 UTCDEBUG:  sendtime 2015-10-02 00:55:20.476873+00 receipttime 2015-10-02 00:55:20.479711+00 replication apply delay 0 ms transfer latency 2 ms
2015-10-02 00:55:20.479 UTCDEBUG:  sending write 476/E480F2D8 flush 476/E480F1D0 apply 476/E480F1D0
2015-10-02 00:55:20.534 UTCDEBUG:  sending write 476/E480F2D8 flush 476/E480F2D8 apply 476/E480F1D0
2015-10-02 00:55:20.534 UTCLOG:  record with zero length at 476/E480F1D0
2015-10-02 00:55:20.534 UTCDEBUG:  sendtime 2015-10-02 00:55:20.477442+00 receipttime 2015-10-02 00:55:20.534532+00 replication apply delay 0 ms transfer latency 57 ms
2015-10-02 00:55:20.534 UTCDEBUG:  sendtime 2015-10-02 00:55:20.485564+00 receipttime 2015-10-02 00:55:20.534891+00 replication apply delay 0 ms transfer latency 49 ms
2015-10-02 00:55:20.534 UTCDEBUG:  sending write 476/E4814EF0 flush 476/E480F2D8 apply 476/E480F1D0
2015-10-02 00:55:20.575 UTCDEBUG:  sending write 476/E4814EF0 flush 476/E4814EF0 apply 476/E480F1D0
2015-10-02 00:55:20.575 UTCFATAL:  terminating walreceiver process due to administrator command
2015-10-02 00:55:20.675 UTCDEBUG:  switched WAL source from stream to archive after failure
2015-10-02 00:55:20.675 UTCLOG:  record with zero length at 476/E480F1D0
2015-10-02 00:55:20.675 UTCDEBUG:  switched WAL source from archive to stream after failure

The first time I got this message I read something which said I should regenerate the hot standby from the master. So I rsync'd everything back over and restarted. The hot standby connects to the master over a TINC VPN connection, tt connected successfully and was streaming changes.

It all seemed to be working fine until around 8am. This is when our system starts to get a bit busier. Then all I see in the logs is the above message and the last WAL file on the hotstandby is 000000010000046F0000006A. If I restart the hot standby, it will start catching up with the master, but it also spikes my CPU and brings everything grinding to a halt, and it never seems to catch up anyway.

I have just migrated the DB and hot standby to two new servers, and I am using the simple configuration from the previous servers on these new machines. On the previous servers, the databases were not dockerized, on this they are. I have other dockerized PostgreSQL installations which are running the replication fine, but these servers do not have as high a load as this one. I even replicated from the new master to the old, non-dockerized PostgreSQL server which used to be the main server. The same thing happened here. I had been using the replication between the old master and hot standby just fine over TINC VPN before the migration. So I don't think Docker or TINC are at fault.

Here is the PostgreSQL configuration for both master and hot standby:

listen_addresses          =  '*'
port                      =  5432

max_connections           =  600
shared_buffers            =  8GB
effective_cache_size      =  16GB
work_mem                  =  10MB
maintenance_work_mem      =  1638GB
max_locks_per_transaction =  256

hot_standby       =  on
archive_mode      =  on

wal_level         =  'hot_standby'
max_wal_senders   =  10
wal_keep_segments =  1000     # 80 GB required on pg_xlog

archive_command   =  'true'

And the additional recovery.conf on the hot standby:

standby_mode = 'on'
primary_conninfo = 'host=MY_IP port=MY_PORT user=MY_USER password=MY_PASSWORD'

Any help would be greatly appreciated! I need my hot standby!

Best Answer

Ok, I finally figured it out (for the most part)!

The important detail I left out in the question was that I was running PostgreSQL on btrfs and was streaming to a different server with ext4. There seemed to be some race condition during high load periods that caused the data being streamed to be corrupted or read incorrectly. I don't know exactly what. Sometimes it failed after 30 seconds, sometimes 30 minutes.

So last night I shut down the system, backed everything up on a separate HDD, reformated my btrfs partition to ext4, moved everything back and brought the system back up. Once I restarted the live replication it caught up and now 24 hours later it is still perfectly in sync, no errors!

So whatever was going on it was related to the btrfs partition. I've spent this entire week trying to figure this out so I hope this saves someone else some time. :-)