I'm running PostgreSQL 11.5. I have a simple table of registers.
create table register(
id serial primary key,
name text not null
);
This table is small (~6000 rows), relatively heavy read (100s queries/s), almost no write. I wanted to add a self-referential foreign key:
alter table register
add column leader_id integer references register(id);
Given the size of the table I assumed this would be a rather mundane change. It ran fine locally and on the staging server. However, when I ran it against production, the table locked up. Logs show ALTER and various SELECT statements taking ~10m to complete.
2020-08-04 00:01:15 UTC:10.0.2.101(59588):postgres@db:[21609]:LOG: could not receive data from client: Connection reset by peer
2020-08-04 00:02:39 UTC:10.0.2.101(59558):postgres@db:[1795]:LOG: could not receive data from client: Connection reset by peer
2020-08-04 00:02:39 UTC:10.0.2.101(59558):postgres@db:[1795]:LOG: unexpected EOF on client connection with an open transaction
2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:LOG: duration: 456741.453 ms execute <unnamed>: alter table register
add column leader_id integer references register(id);
2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:LOG: could not send data to client: Broken pipe
2020-08-04 00:02:39 UTC:10.0.2.101(59578):postgres@db:[18376]:FATAL: connection to client lost
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:LOG: duration: 456635.839 ms statement: SELECT register.deleted_at IS NOT NULL AS deleted, register.client_id AS register_client_id
FROM register
WHERE register.id = 123 AND register.account_id = '22781BD1-F37A-4ACE-9A3D-CBF3464AFB43'::uuid
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:LOG: could not send data to client: Connection timed out
2020-08-04 00:02:39 UTC:10.0.1.227(52906):db@db:[3365]:FATAL: connection to client lost
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:LOG: duration: 456656.956 ms statement: SELECT register.deleted_at IS NOT NULL AS deleted, register.client_id AS register_client_id
FROM register
WHERE register.id = 234 AND register.account_id = 'A6D8395C-63E8-40A8-A0AE-4F19B1DA5509'::uuid
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:LOG: could not send data to client: Connection timed out
2020-08-04 00:02:39 UTC:10.0.1.227(52904):db@db:[3364]:FATAL: connection to client lost
What's going on here that is causing the table to lock up? How can I safely work with a self-referential foreign key?
Best Answer
The quick thing to try is
set lock_timeout=1000;
, then try your ALTER a bunch of times. Most likely you'll hit the timeout each time, but you might get lucky and that'll save you from having to hunt down the actual fix, which is, I'm sorry to say, not nearly as easy.Almost certainly what is going on is that the other processes keep their transaction open after performing their SELECT on register. After that first SELECT is performed, the transaction will have an ACCESS SHARE lock on that table until the transaction commits or rolls back.
Unfortunately, ADD COLUMN requires an ACCESS EXCLUSIVE lock - regardless of whether there's a foreign key, self-referential or not, involved. And, per the table of conflicting lock modes, ACCESS EXCLUSIVE does conflict with ACCESS SHARE.
You can poke into which locks specifically are involved with
pg_locks
, and it's useful to joinpg_stat_activity
to see the actual queries. What follows is 3 psql sessions, interleaved, so you can see what's going on.As you can see, the ALTER TABLE is trying, and failing (I add an asterisk when pg_locks.granted is false) to get an AccessExclusiveLock on the register table.
At this point, your application is going to start having problems. Let's open up a 4th psql and try another SELECT:
PG, seeing that there's an already-waiting ACCESS EXCLUSIVE lock on the table, will no longer grant additional ACCESS SHARE locks on the table until the ACCESS EXCLUSIVE has finished it's thing and released the lock. So now your SELECTs are piling up and everything is stuck until all of the transactions that had selected from
register
as of when the ALTER TABLE first tried to grab the lock either commit or rollback.The fix here, alas, is "don't do that"; the application code needs to issue a COMMIT or ROLLBACK (as long as the SELECT is the only thing that has occurred in the transaction, it doesn't matter which) right after the SELECT, so that the ACCESS SHARE lock is released immediately. You could also split the ADD COLUMN and ADD FOREIGN KEY steps since only the ADD COLUMN requires ACCESS EXCLUSIVE (ADD FOREIGN KEY requires only SHARE ROW EXCLUSIVE, which does not conflict with ACCESS EXCLUSIVE), but I'd be surprised if that helps at all; it's the locking sequence that is causing issues, not the bit of extra work that is involved in doing the ADD FOREIGN KEY.