Postgresql – the fastest way to duplicate inserts of an identity column to another table

postgresql

I have two tables: A data table with an auto-incrementing id column, and an id table. I would like every id in the data table to show up in the ids table on insert. I have many such data tables and I wish to ensure that the ids in each cannot clash with each other. I also wish to foreign key to the global list of ids in the ids table. I would like a system to do this automatically so I don't need to script explicit inserts to the ids table. I know this is straight forward to achieve using triggers, but I want to know the most performant way to do this for data sizes beyond 100m rows. Simplified example code follows:

The tables:

create table data(
  id bigint generated always as identity(start 1 minvalue 1 maxvalue 100000000 no cycle cache 1 increment 1),
  name text
  );

create table ids(id bigint);

I tried using a rule, but had an issue with the volatile nature of the identity column:

create rule url_id_sync as on insert to data do also insert into ids(id) values (new.id);

insert into data(name) values ('some value');

The data:

select * from data;

|id|name      |
--------------
|1 |some value|

select * from ids;
|id|
----
|2 |

…which is not correct of course.

So, I tried a trigger approach:

create or replace function ufnt_internal_identifier_insert() returns trigger as
$$
begin
  insert into ids(id)
  values (NEW.id);
  return new;
end;
$$ language plpgsql
;

create trigger utr_ids_insert after insert on data for each row execute function ufnt_internal_identifier_insert();

This works fine of course:

insert into data(name) values ('some value');

select * from data;
select * from ids;
|id|name      |
--------------
|1 |some value|

select * from ids;
|id|
----
|1 |

Some perf metrics: Doing the insert of 10m rows to the data table without the trigger takes ~24s on my system. With the trigger enabled, it takes ~52s. I guess this is because its doing 2x the number of inserts shrug.

I tried creating a 'sync' statement:

insert into ids(id)
select d.id from data d
left join ids i on d.id = i.id
where i.id is null;

Total time ~22s (~13s for the initial insert, ~9s for the 'sync').

My issues are as follows:

Over time both tables are expected to become large (>1bn rows) and doing a left join in the sync function is going to spill to disk. Its not restricting the records in the data table to ones that have been inserted, its checking all of them every time. However, the operation is batch based.

The trigger approach is firing on ever row insert, I don't know how I can write the trigger to function on a statement level.

Are there any other techniques I can follow up? Would partitions help? Indexes?

Of course in the full production system the ids columns will be primary keys. I left this out of the example code, but I've seen performance improvement by scripting out the constraints before the inserts, and then applying them again.

Best Answer

Try

WITH cte AS ( INSERT INTO data(name) 
              VALUES ('some value'), ('one more value')
              RETURNING id )
INSERT INTO ids 
SELECT id
FROM cte;

fiddle