PostgreSQL Lock Behavior – Table Scan in UPDATE WHERE Clause

lock-escalationpostgresql

Say you have a large table with tens of millions of rows.

You want to UPDATE large_table SET col=value WHERE col=other_value… but col is not indexed and an EXPLAIN shows that this query will perform a seq scan over the whole table.

What is the lock behaviour here? According to most accounts Postgres only locks the affected rows of an UPDATE query and does not have lock escalation. So does it search for the rows to update first, then only lock the found rows? It seems like there would potentially be problems of other queries updating rows concurrently in that case though. Does it lock each row "as it finds them" i.e. locking rows progressively as it goes through the seq scan?

So I think the best case here is it locks rows as it finds them, and the affected rows (only) will be locked for up to as long as the UPDATE query takes to complete.

But I am worried that this query could instead end up blocking all writes to the table until it completes.

I have read this: https://habr.com/en/company/postgrespro/blog/503008/ and I think the worst case will not happen, but here https://blog.heroku.com/curious-case-table-locking-update-query is a possibly inaccurate representation of similar info that gives me some doubts.

The application only uses SELECT, SELECT FOR UPDATE and UPDATE queries (i.e. no other explicit locks taken apart from those). The table has foreign keys to other tables, and other tables have foreign key to this table.

We're on Postgres 11.

Best Answer

For the discussion, let's assume your execution plan looks like

        QUERY PLAN        
--------------------------
 Update on mytab
   ->  Seq Scan on mytab
         Filter: (id = 1)

I also assume that you are using the default READ COMMITTED isolation level.

Then PostgreSQL will sequentially read to the table.

Whenever it finds a row that matches the filter, that row will be locked and updated.

If locking a row is blocked by a concurrent query, PostgreSQL waits until the lock goes away. Then it re-evaluates the filter condition and either moves on (if the condition no longer applies on account of a concurrent modification) or it locks and updates the modified row.

See the documentation:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row. The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row.

In particular, it is possible that two UPDATE statements that each modify several rows deadlock with each other, since they acquire locks as the proceed and locks are always held until the end of the transaction.

Related Solutions

Postgresql – (How) does table inheritance interfere with locking in postgres

Apparently, you want to see constraint exclusion at doing its work. As in similar setups the parent table has no constraint (and, just to mention it, usually no rows), it will be always visited by queries on the parent table. Using your example schema, see the output:

test=# EXPLAIN SELECT count(1) FROM p;
                            QUERY PLAN                             
───────────────────────────────────────────────────────────────────
 Aggregate  (cost=576.00..576.01 rows=1 width=0)
   ->  Append  (cost=0.00..501.00 rows=30001 width=0)
         ->  Seq Scan on p  (cost=0.00..0.00 rows=1 width=0)
         ->  Seq Scan on a  (cost=0.00..164.00 rows=10000 width=0)
         ->  Seq Scan on b  (cost=0.00..164.00 rows=10000 width=0)
         ->  Seq Scan on c  (cost=0.00..173.00 rows=10000 width=0)

test=# EXPLAIN SELECT count(1) FROM p WHERE type_id = 1;
                            QUERY PLAN                             
───────────────────────────────────────────────────────────────────
 Aggregate  (cost=412.00..412.01 rows=1 width=0)
   ->  Append  (cost=0.00..387.00 rows=10002 width=0)
         ->  Seq Scan on p  (cost=0.00..0.00 rows=1 width=0)
               Filter: (type_id = 1)
         ->  Seq Scan on a  (cost=0.00..189.00 rows=10000 width=0)
               Filter: (type_id = 1)
         ->  Seq Scan on c  (cost=0.00..198.00 rows=1 width=0)
               Filter: (type_id = 1)

In the second case, table b is not visited, because the planner knows for sure that there cannot be any rows matching the condition. At the same time, p is present in both cases. The same applies to c, too: it has no easy way (the check) to exclude the table from the plan. In this case, it has to visit all tables that have the possibility to contain the actual rows.

Let's add some check there, too:

ALTER TABLE c ADD CHECK (type_id = 3);

test=# EXPLAIN SELECT count(1) FROM p WHERE type_id = 1;
                            QUERY PLAN                             
───────────────────────────────────────────────────────────────────
 Aggregate  (cost=214.00..214.01 rows=1 width=0)
   ->  Append  (cost=0.00..189.00 rows=10001 width=0)
         ->  Seq Scan on p  (cost=0.00..0.00 rows=1 width=0)
               Filter: (type_id = 1)
         ->  Seq Scan on a  (cost=0.00..189.00 rows=10000 width=0)
               Filter: (type_id = 1)

Much better now.

Now let's have a look on the locks that are taken.

BEGIN; -- to be able to see the locks

SELECT count(1) FROM p WHERE type_id = 1;

SELECT relname, locktype, mode
  FROM pg_locks
  JOIN pg_class c ON relation = c.oid 
  JOIN pg_namespace n ON c.relnamespace = n.oid
 WHERE nspname = 'dba' 
   AND relkind = 'r';

 relname │ locktype │      mode       
─────────┼──────────┼─────────────────
 c       │ relation │ AccessShareLock
 b       │ relation │ AccessShareLock
 a       │ relation │ AccessShareLock
 p       │ relation │ AccessShareLock

This is not so nice... Apparently (and somewhat surprisingly), going to the parent table locks all children, too.

But:

ROLLBACK;
BEGIN; -- to be able to see the locks

SELECT count(1) FROM a WHERE type_id = 1; -- touching only "a"

-- the locks taken:
 a       │ relation │ AccessShareLock

This means that if you can constraint your query to touch only a subset of the tables, it will save you the locks on the other children. Playing around a bit it shows that this applies to other statements (for example, an UPDATE), too, just the lock modes are different.

Postgresql – Does PostgreSQL consider how evenly data is distributed on disk when choosing a seq scan

In short, No.

One of the statistics which PostgreSQL collects is the correlation between the values of a column (actually the ranks of those values) versus the location of that row in the table. You can find that result in pg_stats.correlation for your specific tablename and colname.

However, the particular query you give is not one of the situations where this information is used. The primary use is to estimate how much of the IO done for the table-lookup step of a large index range scan (e.g. "WHERE state='Ca'", or "WHERE age is between 50 and 65") will be sequential-like IO (either really sequential, or close enough to sequential so that read-ahead still triggers, or falling into the same table-block as the previous row so requiring no IO at all) rather than random IO.

Also, since your relation is not monotonic but rather more like a parabola, the estimated correlation will undervalue the strength of the correlation in your case, as it only looks at linear correlations.

Best Answer

Related Solutions

Postgresql – (How) does table inheritance interfere with locking in postgres

Postgresql – Does PostgreSQL consider how evenly data is distributed on disk when choosing a seq scan

Related Question