Postgresql – Too much data queued in replication slots

master-slave-replicationpostgresqlpostgresql-9.4replication

Periodically we run a batch process that sequentially backups (delete, vacuum and later remote storage in AWS' S3) for a set of about 20.000 tables with the following shell script:

#!/bin/bash
DATA_DIR=/home/admin/backup_avl_historico/data
DB_HOST=localhost
DB_USER=waypoint

#extract table list
logger 'Extracting Table List'
psql -h $DB_HOST -U $DB_USER -c "select table_name from information_schema.tables where table_name like 'avl_historico_%';" -t -o $DATA_DIR/tables.list
array=($(wc -l $DATA_DIR/tables.list))
logger ''$array
total_tables=${array[0]}
logger 'Total tables: '$total_tables

#Get max date
max_date=$(psql -h $DB_HOST -U $DB_USER -t -c "select now() - interval '12 months'")
logger 'Max date: '$max_date

array=($max_date)
#date=${array[0]}
date=${array[0]}-${array[1]}; date=${date:0:16}
logger 'Only date: '$date

#Dump each table
while read table_name
do
logger 'looping...'
    if [ ! -z "$table_name" ]; then
        logger 'Processing table '$table_name
        slots=`psql -U waypoint -h localhost -c "SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots" -qt waypoint`
        while [ $slots -gt 50000000 ]
        do
            sleep 1
            echo Sleeping
              slots=`psql -U waypoint -h localhost -c "SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots" -qt waypoint`
        done
        output=${table_name}_pre_${date}.csv
        psql -h $DB_HOST -U $DB_USER -t -F , -c "COPY (select * from reports.$table_name where fecha < '$max_date') TO STDOUT WITH CSV" -o ${DATA_DIR}/$output
        if [ -f ${DATA_DIR}/$output ];then
            if test -s ${DATA_DIR}/$output
            then
                logger 'Deleting records'
                psql -h $DB_HOST -U $DB_USER -c "delete from reports.$table_name where fecha < '$max_date'"
                                logger 'Gzipping '$output
                                pigz  ${DATA_DIR}/$output
                logger 'Moving to S3'
                aws s3 mv ${DATA_DIR}/$output.gz s3://backup-avl-historico/avl_historico/
                logger 'Vacuuming table'
                psql -h $DB_HOST -U $DB_USER -c "vacuum full analyze reports.$table_name"
            else
                rm ${DATA_DIR}/$output
            fi
        fi
    fi
done < $DATA_DIR/tables.list

We've recently changed improved our replication method to use replication slots, to allow the DB handle peaks or large queues of data without de-synching between master and slave. When that change was introduced, executing the backup script would fill the slave's drive, thus rendering the process unusable. We later added a validation to wait if the replication queue exceeded 50MB but, that didn't actually help and last night the process again used all space left on the data drive of the replica node.

What could be wrong/misconfigured? What config and/or log files should I provide?

Best Answer

You are measuring how much WAL you have retained on the primary not yet sent to the replica. But the hard drive filling up is the one on the replica, not the one on the master, so this is not the right thing to measure.

So what you need to measure is the amount of WAL received by the replica, but not yet replayed and eligible for removal. You can do this by connecting to the replica (if it is in hot-standby mode) and running pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn(). But that is probably not quite right either, as just because it has been replayed doesn't mean it can be and has been removed, it still has to wait for the restartpoint. So it would probably be better to check the disk space directly, for example by using ssh to run a df command.