PostgreSQL INSERT Stuck – Causes and Solutions

insertlockingpostgresql

I had a weird issue with our analytical pipeline and i'm not sure exactly what I can provide for this situation.

Essentially I had a 504 gateway timeout alert around 3am during one of our analytical pipelines. These processes are read heavy but fairly light write heavy.

I then noticed low CPU utilization on both our analyitcal processes and postgresql backends.

I then wanted to see what queries were running and their start times so I ran

select pid, xact_start,query_start, query from pg_stat_activity;

The queries here were not our usual analytical I expected but did notice that 1 PID had a xact_start at 3am trying to insert a log into our log table.

This made me think okay is there a lock? So I queried trying to find if any PIDs were blocking other PIDs and that query came up empty.
(query used from https://wiki.postgresql.org/wiki/Lock_dependency_information)

So this made me realize that this stuck insert statement is what caused the 504 gateway timeout and would not insert. The log table is append only.

I checked the postgresql logs around that time.

2020-09-30 03:02:30.864 EDT [1646] LOG:  checkpoint starting: time
2020-09-30 03:04:22.905 EDT [3153] relic@relic ERROR:  duplicate key value violates unique constraint "ms1_peak_target_ms1_peak_id_key"
2020-09-30 03:04:22.905 EDT [3153] relic@relic DETAIL:  Key (ms1_peak_id)=(22157648) already exists.
2020-09-30 03:04:22.905 EDT [3153] relic@relic STATEMENT:  UPDATE api.ms1_peak_target SET ms1_peak_id=22157648, comments='ms1_peak_id set to alternate. Original peak: 22153965' WHERE api.ms1_peak_target.ms1_peak_target_id = 4648
2020-09-30 03:04:35.085 EDT [1646] LOG:  checkpoint complete: wrote 1127 buffers (0.0%); 0 WAL file(s) added, 0 removed, 11 recycled; write=124.172 s, sync=0.020 s, total=124.221 s; sync files=115, longest=0.003 s, average=0.000 s; distance=181754 kB, estimate=251591 kB

Conflict updates are fine in this specific workflow but that happens before we log to the log table where this pg_backend was stuck.

Once I killed that stuck insert backend process (running since 3am, its 1030am now) the database returned back to life, all other processes started using much more CPU for the analytical workloads and everything continued.

System:

Postgresql: 12.2
CPU 24cores
RAM: 148gigs of ram

Settings

shared_buffers = 32GB
effective_cache_size = 96GB
wal_buffers = 16MB
checkpoint_timeout = 30min
max_wal_size = 20GB
min_wal_size = 1GB
max_worker_processes = 20
max_parallel_workers_per_gather = 2
max_parallel_workers = 20

I wish I had more details to provide or other logs / errors to show during that time. What could cause an insert statement to stall and cause other transactions / backends to slow to a crawl?

Edit

Adding details as sugested showing state and wait event for the insert query in question

  pid  |        state        | wait_event |          xact_start           |          query_start          | query
7828 | idle in transaction | ClientRead | 2020-09-30 18:24:09.234009-04 | 2020-09-30 18:24:09.411903-04 | INSERT INTO api.log

And here is pg_locks looking at pid 7828 from above

   locktype    | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid  |       mode       | granted | fastpath
---------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+------+------------------+---------+----------
 relation      |    16597 |    36906 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowExclusiveLock | t       | t
 relation      |    16597 | 22678058 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowExclusiveLock | t       | t
 relation      |    16597 |    36700 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 |    36693 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 |    36691 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 |    36682 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 26512029 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 26512023 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 22678077 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 22678076 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 22678069 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 22678060 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 22678060 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowExclusiveLock | t       | t
 relation      |    16597 | 26512072 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 26512070 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 relation      |    16597 | 26512062 |      |       |            |               |         |       |          | 17/1603484         | 7828 | AccessShareLock  | t       | t
 virtualxid    |          |          |      |       | 17/1603484 |               |         |       |          | 17/1603484         | 7828 | ExclusiveLock    | t       | t
 relation      |    16597 |    36904 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowExclusiveLock | t       | f
 transactionid |          |          |      |       |            |      87192816 |         |       |          | 17/1603484         | 7828 | ExclusiveLock    | t       | f
 relation      |    16597 |    36693 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowShareLock     | t       | f
 relation      |    16597 |    36700 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowShareLock     | t       | f
 relation      |    16597 |    36682 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowShareLock     | t       | f
 relation      |    16597 |    36691 |      |       |            |               |         |       |          | 17/1603484         | 7828 | RowShareLock     | t       | f

Another interesting thing is I noticed MultiXactOffsetControlLock as a wait_event in pg_stat_activity

7830 | active              | MultiXactOffsetControlLock | 2020-09-30 22:33:28.423817-04 | 2020-09-30 22:33:29.006849-04 | UPDATE auth."user"

Not sure if this is related or not

Edit 2
Content too large to paste so provided paste bin links

Entire pg_locks
https://pastebin.com/qeV7W3ww

Entire pg_stat_activity
https://pastebin.com/9MV5u03q

Best Answer

Nobody is blocked by a lock.

The insert has finished long ago, but the application forgot to commit the transaction. That is an application error, and you should start investigating there.

Processing must be blocked somewhere in the client application, because it sure isn't blocked in the database.

I see some MultiXactOffsetControlLock in your pg_stat_activity, which indicates a performance bottleneck with concurrent data modifications on tables with foreign keys, but no locks.