Postgresql – pg_dump. Some COPY processes stuck on ‘idle in transaction’ after pg_dump completion

pg-dumppostgresql

We do a dump of huge database (~ 2 TB) via pg_dump command:

pg_dump -h 192.168.0.132 -d vodka_factory -U vodka_user -F d -j 10 -v -f /backups/vodka_factory_2021-03-26-23-56.dump/

pg_dump process successfully (return code 0, no errors), but on database we can see ~8 COPY processes and one "select" in 'idle in transaction' which never finish:
enter image description here

No locks. More detailed information for some processes:

postgres=# select * from pg_stat_activity where pid=358683;
-[ RECORD 1 ]----+---------------------------------------------------------------
datid            | 16404
pid              | 358683
application_name | pg_dump
client_addr      | 10.136.131.164
client_hostname  | 
client_port      | 57732
backend_start    | 2021-03-26 23:56:43.311535+03
xact_start       | 2021-03-26 23:56:43.321329+03
query_start      | 2021-03-26 23:57:07.682549+03
state_change     | 2021-03-26 23:57:07.683027+03
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      | 
backend_xmin     | 3802998627
query            | SELECT last_value, is_called FROM public.waterfall_task_id_seq
backend_type     | client backend

postgres=# select * from pg_stat_activity where pid=358919;
-[ RECORD 1
datid            | 16404
pid              | 358919
application_name | pg_dump
client_addr      | 10.136.131.164
client_hostname  | 
client_port      | 57744
backend_start    | 2021-03-26 23:57:07.836322+03
xact_start       | 2021-03-26 23:57:07.844714+03
query_start      | 2021-03-27 00:29:19.240355+03
state_change     | 2021-03-27 00:30:10.458445+03
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      | 
backend_xmin     | 3802998627
query            | COPY public.slr_child_care (id, benefit_type_id, child_count, child_index, start_month, region_index, ent_id, created, modified, external_id, begin, "end", candidate_id, hospital_json, calcs_json, years_replace, first_year_replace, second_year_replace, excluded_days, old_end, compensation, alg_code, state, correction_id, ref_number, date_of_issue, regional_factor, share_by) TO stdout;
backend_type     | client backend

Databse version: PostgreSQL 11.6 on x86_64-pc-linux-gnu
pg_dump version: pg_dump (PostgreSQL) 11.8

Could you please advice how we can investigate the issue? The iisue happens every time…

Best Answer

It is slightly surprising that pg_dump returned sucessfully with a return code of 0. The only theory I can come up with is that some of the 10 jobs got done early and started to idle, whereupon a misconfigured firewall cut the connection. The backends should die after around 2 hours due to TCP keepalive.

What you could do is reduce the keepalive idle time and see if that improves things:

pg_dump -d 'host=... port=... dbname=... user=... keepalives_idle=300' -f dumpfile