Postgresql – When Master is being auto vacuumed, Is slave gettting exclusive lock on its table to apply WAL

lockingpostgresqlwrite-ahead-logging

I've met the very curious pattern that triggers a lot of wait queue on slaves.
And I think I finally find it's because of auto vacuum. However, I've never got such description on official document nor any of blogs of PostgreSQL.
So I would really appreciate if someone knows this question.

I have a big table(I'll say, A table) on Master, and I run periodically a batch script on slaves and this batch calls a lot of SELECT queries on the A table.

One day, I found that on slave there are a lot of wait queues, and there were waiting on the SharedAccessLock on the A table. However, these queries are also SELECT query and so these queries should never wait any SELECT queries.

So I've searched around and I found that when slaves are being got lots of SELECT queries on the A table, and at the same moment Master is executing autovacuum on the A table. I stopped autovacuum on Master and kill the autovacuum process, then restart all the queries to slaves. After that wait queues disappeared.

Error logs say, these SELECT queries on slaves wait SharedAccessLock, however I am guessing that these SELECT queries waits the WAL apply. I know it's very odd to say like this thing. My PostgreSQL is 9.6ver.

Best Answer

If autovacuum finds a bunch of empty pages at the end of the table, it will try to truncate those pages away to return the space to the OS. It will need to obtain an Access Exclusive lock to do this, and that lock is recorded in WAL and is replayed by and held by the replica server for the duration of the truncation (and replay of the truncation WAL). hot_standby_feedback will not directly prevent this lock from being taken, although it may delay the pages from becoming empty (as the tuples won't be removed if they are still needed on the standby) but sooner or later they will need to go.

If you are truncating away a large amount of empty pages, this can take a while (but that performance was improved substantially in v10). If the truncation on the master detects that it is blocking someone, it will stop the truncation before completion, leaving the rest of the empty pages to be truncated by some future vacuum. But it cannot detect that it is blocking queries on the replica, so it cannot abort the truncation. Instead of killing the autovac worker, you could have just opened a session on the master and taken a dummy lock on the table to force the autovac worker to yield gracefully. The dummy lock does not need to be held for a prolonged period, it can be dropped as soon as it is acquired, and it can be a weak lock, even just Access Share.

As far as I know, this is the only situation in which autovacuum will take an Access Exclusive lock.