PostgreSQL – Performance Impact of Updating Target Columns with ON CONFLICT

performancepostgresqlpostgresql-performance

This is a question about the inner-workings of Postgres (v10) and performance.

Given a table, github_repos, with a multi-column unique index on org_id and github_id columns, is there any performance difference (or other issues to be aware of) between the two bulk upsert operations below? The difference is that in the first query, the org_id and github_id columns are included in the UPDATE, whereas in the second query they are not. Since the UPDATE runs ON CONFLICT, the updated values of org_id and github_id will be the same as the old values, but those columns are included in the UPDATE because the underlying library I am using is designed that way. I'm wondering if its safe to use as-is or whether I should be explicitly excluding those columns in the UPDATE.

Query #1:

INSERT INTO "github_repos" ("org_id","github_id","name")
VALUES (1,1,'foo')
ON CONFLICT (org_id, github_id)
DO UPDATE SET "org_id"=EXCLUDED."org_id","github_id"=EXCLUDED."github_id","name"=EXCLUDED."name"
RETURNING "id"

Query #2:

INSERT INTO "github_repos" ("org_id","github_id","name")
VALUES (1,1,'foo')
ON CONFLICT (org_id, github_id)
DO UPDATE SET "name"=EXCLUDED."name"
RETURNING "id"

github_repos table:

      Column       |       Type        | Collation | Nullable 
-------------------+-------------------+-----------+----------+
 id                | bigint            |           | not null |
 org_id            | bigint            |           | not null |
 github_id         | bigint            |           | not null |
 name              | character varying |           | not null |

Indexes:
    "github_repos_pkey" PRIMARY KEY, btree (id)
    "unique_repos" UNIQUE, btree (org_id, github_id)

Best Answer

It seems to have no difference from performance perspective , assuming you have no triggers that runs only if certain columns are updated.

The whole row is updated , or to be more Postgres specific (Postgres doesn't have in-place updates), the new tuple will be inserted, and the old one will be marked as dead . From that regard it doesn't matter if actual change happens for only one column, or all of them , or neither . You can check values of n_tup_upd, n_dead_tup columns in pg_stat_all_tables - they both increase after each update . n_dead_tup will eventually get reset after performing vacuum.

The only thing that might be negatively affected is HOT updates (no new index tuple if indexed column doesn't change value). HOT updates are explained in many places, the best one in my opinion is http://www.interdb.jp/pg/pgsql07.html .

Testing HOT updates (done with Postgres 9.6, 10.4 and 10.5)

--- setup
-- useful extension for checking data and index pages
create extension pageinspect; 
-- dummy table  (varchar is used instead of text just to avoid any complication related to TOAST )
create table test_update(test_update_id int not null , some_data varchar(50),constraint test_update_pkey primary key (test_update_id) );

-- insert  2 rows
insert into test_update (test_update_id, some_data) values (1, 'test1');
insert into test_update (test_update_id, some_data) values (2, 'test2');

--check table stats : 
select relname ,n_tup_upd, n_dead_tup , n_tup_hot_upd  from pg_stat_all_tables where relname ='test_update';
 --(all zeroes) 
-- check #rows in index :
select * from bt_page_items('test_update_pkey',1) ;
-- 2 rows

----------------------------------------------
-- Test 1 . perform dummy update : 
update  test_update set test_update_id = test_update_id , some_data = some_data;

-- check tables stats again, 
--n_tup_upd = n_dead_tup = n_tup_hot_upd  = 2 
-- check index page again, results are the same - 2 rows

-- Test 2 . real update non-key column , and dummy update of key column  :
update  test_update set test_update_id = test_update_id , some_data = some_data||'_new';

-- check tables stats again, 
--n_tup_upd = n_dead_tup = n_tup_hot_upd  = 4 
-- check index page again, results are the same - 2 rows

-- Test 3 . real update non-key column , key column is not in the statement:
update  test_update set some_data = some_data||'_new_2';

-- check tables stats again, 
--n_tup_upd = n_dead_tup = n_tup_hot_upd  = 6 
-- check index page again, results are the same - 2 rows

-- Test 4. Now do real update of key column :

update  test_update set test_update_id = test_update_id+2;
-- check tables stats again, 
--n_tup_upd = n_dead_tup = 8, n_tup_hot_upd  = 6  (no HOT update this time)
-- check index page again, now we see 4 rows

It appears that Postgres is smart enough to identify cases where indexed columns are not changed , and perform HOT updates; thus , there is no difference between having or not having key columns in update statement from performance point of view. The only thing that matters it whether actual value changed. Surely, this behaviour is limited to B-Tree indexes.