Postgresql – the fastest way to duplicate inserts of an identity column to another table

postgresql

I have two tables: A data table with an auto-incrementing id column, and an id table. I would like every id in the data table to show up in the ids table on insert. I have many such data tables and I wish to ensure that the ids in each cannot clash with each other. I also wish to foreign key to the global list of ids in the ids table. I would like a system to do this automatically so I don't need to script explicit inserts to the ids table. I know this is straight forward to achieve using triggers, but I want to know the most performant way to do this for data sizes beyond 100m rows. Simplified example code follows:

The tables:

create table data(
  id bigint generated always as identity(start 1 minvalue 1 maxvalue 100000000 no cycle cache 1 increment 1),
  name text
  );

create table ids(id bigint);

I tried using a rule, but had an issue with the volatile nature of the identity column:

create rule url_id_sync as on insert to data do also insert into ids(id) values (new.id);

insert into data(name) values ('some value');

The data:

select * from data;

|id|name      |
--------------
|1 |some value|

select * from ids;
|id|
----
|2 |

…which is not correct of course.

So, I tried a trigger approach:

create or replace function ufnt_internal_identifier_insert() returns trigger as
$$
begin
  insert into ids(id)
  values (NEW.id);
  return new;
end;
$$ language plpgsql
;

create trigger utr_ids_insert after insert on data for each row execute function ufnt_internal_identifier_insert();

This works fine of course:

insert into data(name) values ('some value');

select * from data;
select * from ids;
|id|name      |
--------------
|1 |some value|

select * from ids;
|id|
----
|1 |

Some perf metrics: Doing the insert of 10m rows to the data table without the trigger takes ~24s on my system. With the trigger enabled, it takes ~52s. I guess this is because its doing 2x the number of inserts shrug.

I tried creating a 'sync' statement:

insert into ids(id)
select d.id from data d
left join ids i on d.id = i.id
where i.id is null;

Total time ~22s (~13s for the initial insert, ~9s for the 'sync').

My issues are as follows:

Over time both tables are expected to become large (>1bn rows) and doing a left join in the sync function is going to spill to disk. Its not restricting the records in the data table to ones that have been inserted, its checking all of them every time. However, the operation is batch based.

The trigger approach is firing on ever row insert, I don't know how I can write the trigger to function on a statement level.

Are there any other techniques I can follow up? Would partitions help? Indexes?

Of course in the full production system the ids columns will be primary keys. I left this out of the example code, but I've seen performance improvement by scripting out the constraints before the inserts, and then applying them again.

Best Answer

Try

WITH cte AS ( INSERT INTO data(name) 
              VALUES ('some value'), ('one more value')
              RETURNING id )
INSERT INTO ids 
SELECT id
FROM cte;

fiddle

Related Solutions

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Your problem is that when you add a new column to the_person, its child, the_person_two will have this field appended at the end of columns list (4th position), so after has_default column. See:

db=> \d temp_person
  Column   |       Type        |                            Modifiers                            
-----------+-------------------+-----------------------------------------------------------------
 person_id | integer           | not null default nextval('temp_person_person_id_seq'::regclass)
 name      | character varying | 
 foo       | text              | 

db=> \d temp_person_two 
   Column    |         Type         |                            Modifiers                            
-------------+----------------------+-----------------------------------------------------------------
 person_id   | integer              | not null default nextval('temp_person_person_id_seq'::regclass)
 name        | character varying    | 
 has_default | character varying(4) | not null default 'en'::character varying
 foo         | text                 |

So, when you execute this:

INSERT INTO temp_person_two VALUES ( NEW.* );

PostgreSQL will actually understand that you want to insert on the first three columns of temp_person_two (as NEW.* will expand to three values), generating something similar to this:

INSERT INTO temp_person_two(person_id,name,has_default)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

So, temp_person_two.has_default will get the value of NEW.foo, which is NULL in your case.

The solution is to simply expand the column names:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

or, you could also use this:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.* );

But this is weak, as any changes on column positions may break your statements, so I'd recommend the first one.

EDIT:

So the conclusion and the lesson learned here is:

Always explicitly type the names of the columns and the values when issuing an INSERT command, in fact, when issuing any SQL command at all... =D

This will save you a lot of time solving problems like that in future.

Postgresql – insert inserted id to another table

You can utilize the returning keyword which will return the value of the created serial column. If you wrap you Insert in a With expression, you can then access the returned id and insert into the second table.

WITH getval(id) as
    (INSERT INTO table_a (some_col) VALUES (some_val) RETURNING id) 
INSERT into table_b (id) SELECT id from getval;

where id is the value of the serial resulting from the insert into table a. More in the docs, though it is a somewhat hidden feature.

Note, you can also use returning * to return the entire updated row, not just the serial column, should you need it.

Best Answer

Related Solutions

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Postgresql – insert inserted id to another table

Related Question