PostgreSQL – Make ALTER TABLE Wait for Lock Without Blocking

alter-tablelockingpostgresql

Lots of PostgreSQL ALTER TABLE commands, like adding a new column with a default value, have clever optimisations in the latest versions of PostgreSQL that allow them to execute basically instantly, even on large tables, once Postgres has briefly acquired a lock on the table.

Unfortunately, that final caveat is important. A command like this one from the linked blog post

ALTER TABLE users ADD COLUMN credits bigint NOT NULL DEFAULT 0;

still needs to wait for an exclusive lock on the users table before it can run, even though it'll execute instantly once the lock is acquired. Worse, while it's waiting for that lock, it blocks all writes and reads involving the table.

Some simple steps to reproduce this (tested in Postgres 13.3):

In one psql shell, create a table, then start a transaction, do a read from the table, and don't commit:

CREATE TABLE users (id SERIAL, name TEXT);
INSERT INTO users (name) VALUES ('bob'), ('fred');
START TRANSACTION;
SELECT * FROM users WHERE id = 1;

Leave the first shell open, then open a second one and try to alter the table:
```
ALTER TABLE users ADD COLUMN credits bigint NOT NULL DEFAULT 0;
```
Observe that this query hangs, waiting for the transaction in the first shell to be committed.
Open a third terminal, and try to run
```
SELECT * FROM users WHERE id = 2;
```
Observe that this also hangs; it's now blocked waiting for the ALTER TABLE command to complete, which is in turn blocked waiting for the first transaction to complete.

It seems that most or all ALTER TABLE commands behave like this. Even if the operation itself is very fast or can run without holding a lock for the whole operation, ALTER TABLE still needs to briefly acquire an exclusive lock on the table before it can start its work, and while it's waiting for that lock, all other statements that touch the table – even reads! – are blocked.

Needless to say, this behaviour is pretty problematic if you want to make changes to a table that is occasionally involved in long-running transactions. If the ALTER TABLE statement gets blocked by a long-running transaction that happens to be holding any kind of lock involving the table at the moment the ALTER TABLE statement is run, all interactions with that table are blocked until the end of whatever that random long-running transaction was, and anything depending on that table probably experiences downtime.

Is there a canonical solution to this problem?

A crude solution that I've tried is to use a wrapper script that repeatedly attempts to run the ALTER TABLE statement via a connection with lock_timeout set to a small value (e.g. 5 seconds). If the ALTER TABLE fails due to hitting the lock timeout, the transaction aborts and the script catches the error, waits for a minute or two, and tries whole process again. This avoids outright downtime, but still has performance implications, since every failed attempt to run the ALTER TABLE statement still blocks queries for a few seconds.

What I'd really like to do is somehow tell Postgres that I want the ALTER TABLE statement to wait for a moment when it can acquire the lock on the table without blocking other queries in the meantime. (I don't mind if that means it waits hours until it finally reaches a moment where no other queries are touching the table; if it avoids blocking other queries, that's absolutely an acceptable tradeoff.) Is there some way to do this – perhaps some incantation I can include in the ALTER TABLE statement, or some configuration parameter I can set to change this behaviour?

Best Answer

Unfortunately, there is no great alternative to just retrying in a loop. But you can perhaps make the retrying more clever. When I need to do this and can be in a transaction block, I take the lock explicitly, and use the NOWAIT option.

but still has performance implications, since every failed attempt to run the ALTER TABLE statement still blocks queries for a few seconds.

You can set the time out value to be lower (much lower) than a few seconds. Or you could use NOWAIT, which should be about the same thing as setting lock_timeout to its lowest possible value, but automatically resetting once that lock is acquired, (relevant in case of multi-statement transactions).

What I'd really like to do is somehow tell Postgres that I want the ALTER TABLE statement to wait for a moment when it can acquire the lock on the table without blocking other queries in the meantime.

Yeah, having some better options here would be nice. It might be contentious to figure out exactly what that would look like, though. Maybe something like MySQL's low priority lock, which keeps itself in the wait queue but lets other waiters jump over it if the other waiter could immediately get the lock in the mode it wants.

Related Solutions

Sql-server – SQL Server Transaction Timeout

Extending Mark's answer...

When a client timeout event occurs (.net CommandTimeout for example), the client sends an "ABORT" to SQL Server. SQL Server then simply abandons the query processing. No transaction is rolled back, no locks are released.

Now, the connection is returned to the connection pool, so it isn't closed on SQL Server. If this ever happens (via KILL or client reboot etc) then the transactions+locks will be cleared. Note that sp_reset_connection won't or doesn't clear them, even though it is advertised to do so

This detritus from the abort will block other processes.

The way to make SQL Server clear transactions+locks on client timeout (strictly, ABORT events) is to use SET XACT_ABORT ON.

You can verify this be opening 2 query windows in SSMS:

Window 1:

In menu Query..Query Options set a timeout of 5 seconds then run this

BEGIN TRAN
UPDATE sometable WITH (TABLOCKX) SET foo = foo WHERE 1 = 0;
WAITFOR DELAY '00:00:10' -- just has to be longer then timeout

Window 2, this will wait forever (or hit your timeout)

SELECT * FROM sometable

SET XACT_ABORT ON has interesting side effects too:

@@TRANCOUNT is set to zero on the implicit rollback but error 266 is suppressed (this happens if @@TRANCOUNT is different on entry and exit from a stored proc)
XACT_STATE will be -1 (it's "doomed")

The combination of this means that you can't use SAVEPOINTS (although, I can't recall exact behaviour) for partial commits/rollbacks. Which suits me

SO links on SET XACT_ABORT:

https://stackoverflow.com/a/1539686/27535 (me)
https://stackoverflow.com/a/919279/27535 (me)
https://stackoverflow.com/a/1150350/27535 (Remus Rusanu)

On nested stored procs:

https://stackoverflow.com/a/2074139/27535 (me)

On sp_reset_connection:

https://stackoverflow.com/a/2924456/27535

Sql-server – All-or-none exclusive lock on 2 SQL tables

No, locks are acquired one-after-the-other because the number of locks is generally unpredictable. It might change with row count. This could work differently, but it doesn't.

Your locking technique basically doesn't work. The best workaround I can think of is to write a retry-loop which tries to acquire both locks and retries on deadlock.

Best Answer

Related Solutions

Sql-server – SQL Server Transaction Timeout

Sql-server – All-or-none exclusive lock on 2 SQL tables

Related Question