Postgresql – Postgres : Using WITH … AS insert into to return non-inserted column

postgresqlstored-procedures

I'm creating an import job which uploads to a temporary table and then splits the data into two separate tables, joined by a key.

What I'm doing is more complicated than the following but this simple example demonstrates what I want to do.

So let's say I have a temporary table temp_import defined by :

create temp_import(
    row_id int,
    name varchar(32),
    email varchar(255)
)

I have two target tables defined by :

create users (
    userid int not null default nextval('tmp_uid_seq'::regclass),
    username varchar(32)
)

create emails (
    userid int,
    email varchar(255)
)

Currently I have some plpython which inserts a row at a time into users returning userid, followed by a second insert which uses the returned userid to populate emails.

loop:
insert into users (username) values 'Blah' returning userid
insert into emails (userid, email) values (userid, 'Blah2') 
goto loop

This works OK but I want to extract better performance by doing set based operations.

I want to create a temporary mapping table :

create table mapping (
    row_id int,
    userid int
)

and try to do something like :

with rows as (insert into users (username) 
select username from temp_import returning userid, row_id)
insert into mapping (userid, row_id)

The issue here is that I cannot return row_id as I am not selecting it for insert into the target table

If I could do this I could populate the emails table in one go:

insert into emails (userid,email)
select userid,email from temp_import t, mapping m
where t.row_id = m.row_id

Anyone know how I can return the row_id in the with rows as statement above?

Thanks in advance.

Best Answer

One possible solution is adding userid to the staging table you already have:

CREATE TABLE temp_import (
    row_id int,
    name varchar(32),
    email varchar(255),
    userid int not null default nextval('tmp_uid_seq'::regclass)
);

Use the same column definition above as you do in users - this way you use the same sequence, and even the nextval() will be correct after the import.

When doing the COPY, the userid will be filled. This way you have the mapping and the staging all in the same place. Now you have to copy the userid into the final table, too:

INSERT INTO users (userid, name)
SELECT userid, name
  FROM temp_import;

INSERT INTO mapping (userid, row_id)
SELECT userid, row_id
  FROM temp_import;

Notes:

  • in PostgreSQL, unlimited text and varchar have some performance advantage over the limited types. Use the latter if you really have a limit. With varchar(255) the case is usually 'We don't know how long it can be, but definitely not that long'. Email addresses are usually considered as limited to 254 characters, not 255: https://www.rfc-editor.org/errata_search.php?eid=1690 Also, are you sure that nobody can have a name longer than 32 characters, including spaces, punctuation and so on?
  • it usually makes sense to use some consistent naming scheme in any software system, including databases. I'm already confused a bit about the different styles used in userid and row_id.
  • if the import will happen repeatedly or in parallel, just use the same sequence all the time. Sequences are transaction-safe and won't reuse any values until they wrap over (which can be mostly prevented by using bigint/bigserial).