Postgresql Vacuum and replication

postgresqlreplicationvacuum

I have an architecture of cascading replication with 3 postgresql db

enter image description here

Because the master is becoming too big (~ 170Gig), I wanted to run a "cleaning" job over the week-end that would do multiple DELETE operations on Millions of rows by batch and VACUUM table right after.
Unfortunately my cleaning script could not complete because disk of DB2 got filled (pg_xlogs?)

2019-06-21 17:41:08.770 UTC [1136] FATAL: could not extend file "base/34163166/44033600.20": No space left on device
2019-06-21 17:41:08.770 UTC [1136] HINT: Check free disk space. 2019-06-21 17:41:08.770 UTC [1136] CONTEXT: xlog redo at 662/6A087C30 for Heap/INSERT+INIT: off 1 2019-06-21 17:41:09.188 UTC [13036] FATAL: could not write to file "pg_xlog/xlogtemp.13036": No space left on device

My disks size are all the same (db1 is bigger because the server has other files on it):

DB1: /dev/sda1 496G 266G 210G 56% /

DB2: /dev/sda1 496G 177G 299G 38% /

DB3 /dev/sda1 496G 177G 299G 38% /

size of the main table is the same on master and replicas

Size | 173 GB

Databases are all running on different VMs. The commands I ran are simple, SELECT count(*) FROM TABLE1 WHERE X, then DELETE FROM TABLE1 WHERE X, then VACUUM TABLE1. And this script is iterating by batch of 5 days on a timestamp column.
Hence my VACUUM command is split per table. I've been doing backups before starting the script (with a pg_backup that is then sending the back up files to another backup server)

I am actually not sure why the disk of db2 got filled :/ , would you have any idea ? I need to run my script so I think I might need to stop the replication to run it.

Thank you for your help, cheers

Best Answer

The standby server has to write the WAL that it gets from the primary server, it needs enough disk space for that.

Since DB2 is not only a standby, but also a primary server for DB3 and DB4, it might need to retain WAL for a while (I don't know how you configured that).

Just give all standby servers as much disk space as the primary has, then you should be good. Be generous with disk space!

It could be that extra disk space is used if you use replication slots and a standby cannot keep up, perhaps because of a slow network connection. That could be the cause of your out-of-disk condition.

Remember that DELETE and VACUUM do not free any disk space.