PostgreSQL – Split Table and Set Up References

bulk-insertjoin;postgresql

My title is a little misleading, what I am doing is inserting data from a CSV file where each row has information corresponding to three different tables. You can imagine the schema being something like this:

CREATE TABLE runner (id integer PK, name varchar);
CREATE TABLE race (id integer PK, name varchar);
CREATE TABLE race_result (id integer PK, race_id integer, runner_id ineteger, fp integer);

When I receive the data, it is in a CSV file where each row looks like:

runner_name, race_name, finish_position

Each runner can have multiple race results and so can each race. I am trying to figure out an efficient way to insert this data. I can first insert all runners, then insert all races, then finally insert all race_results and loop up all the ids to fill in the FKs, but this would be awfully slow. Another thing I am trying is:

CREATE TEMP TABLE tmp_x (run_name vc, race_name vc, fp int);
COPY tmp_x FROM 'input.csv' (FORMAT csv);

WITH run_ins AS (
  INSERT INTO runner (name)
    SELECT run_name
    FROM tmp_x
  ON CONFLICT ON some_contraint DO NOTHING
  RETURNING id
), rac_ins AS (
  INSERT INTO race (name)
    SELECT race_name
    FROM tmp_x
  ON CONFLICT ON some_constraint DO NOTHING
  RETURNING id)
INSERT INTO race_result (race_id, runner_id, fp)
  SELECT rc.id, ru.id, t.fp
  FROM rac_ins rc, run_ins ru, tmp_x t;

DROP TABLE tmp_x;

The problem is that this joins the tables creating more results then there actually are. First, is this the best way to go about it? If it is, can I remedy this by explicitly JOINing the tables in some way? I think my problem is that I do not fully understand what happens when I use the WITH statement. Does it do all of the inserts first then creates a temp table/view that has all of the ids of the records that were inserted? If so, does it return an id twice if I tried to insert but there was a conflict on a constraint?

Thank you

Best Answer

Let's assume this is your tables (I've changed naming for clarity and ease of use):

CREATE TABLE runner 
(
    runner_id serial PRIMARY KEY,
    runner_name varchar UNIQUE
);

CREATE TABLE race 
(
    race_id serial PRIMARY KEY,
    race_name varchar UNIQUE
) ;

CREATE TABLE race_result 
(
    race_id integer REFERENCES race(race_id), 
    runner_id integer REFERENCES runner(runner_id), 
    finish_position integer NOT NULL,
    UNIQUE (race_id,finish_position),
    PRIMARY KEY (runner_id, race_id)
);

CREATE /* TEMP */ TABLE tmp_x 
(
    runner_name varchar, 
    race_name varchar, 
    finish_position integer
);

And this is your data

-- COPY tmp_x FROM 'input.csv' (FORMAT csv);
INSERT INTO tmp_x
VALUES
  ('runner1', 'race1', 1),
  ('runner2', 'race1', 2),
  ('runner3', 'race1', 3),
  ('runner4', 'race1', 4),
  ('runner5', 'race1', 5),
  ('runner1', 'race2', 5),
  ('runner2', 'race2', 4),
  ('runner3', 'race2', 3),
  ('runner4', 'race2', 2),
  ('runner5', 'race2', 1),
  ('runner6', 'race3', 1),
  ('runner2', 'race3', 2),
  ('runner7', 'race3', 3),
  ('runner4', 'race3', 4),
  ('runner9', 'race3', 5) ;

I would insert the data in 3 steps. First, you INSERT in your runners table the ones you don't yet have. Then you do exactly the same for races.

-- Create new runners
INSERT INTO
   runner (runner_name)
SELECT DISTINCT
   runner_name
FROM
   tmp_x
WHERE
   NOT EXISTS (SELECT * 
                 FROM runner r0 
                WHERE r0.runner_name = tmp_x.runner_name) 
ORDER BY
    runner_name ;

-- Create new races
INSERT INTO
   race (race_name)
SELECT DISTINCT
   race_name
FROM
   tmp_x
WHERE
   NOT EXISTS (SELECT * 
                 FROM race r0 
                WHERE r0.race_name = tmp_x.race_name) 
ORDER BY
    race_name;

NOTE: Using "NOT EXISTS..." works with versions of PostgreSQL older than 9.6, and is (AFAIK) pure standard SQL; so, it should work on other DB as well.

And now that your master tables are filled in, you JOIN your temporary table to them, and add your finish positions. [I assume the combinations are new data, and we do not need to check for already existing data.]

INSERT INTO
    race_result (race_id, runner_id, finish_position)
SELECT
    race_id, runner_id, finish_position
FROM
    tmp_x
    JOIN runner USING(runner_name)
    JOIN race USING(race_name) 
ORDER BY
    race_id, runner_id, finish_position ;

At this point, you can query:

SELECT
  race_name, finish_position, runner_name
FROM
  race_result 
  JOIN runner USING(runner_id)
  JOIN race USING(race_id) 
ORDER BY
  race_name, finish_position, runner_name;

... and get

| race_name | finish_position | runner_name |
|-----------|-----------------|-------------|
|     race1 |               1 |     runner1 |
|     race1 |               2 |     runner2 |
|     race1 |               3 |     runner3 |
|     race1 |               4 |     runner4 |
|     race1 |               5 |     runner5 |
|     race2 |               1 |     runner5 |
|     race2 |               2 |     runner4 |
|     race2 |               3 |     runner3 |
|     race2 |               4 |     runner1 |
|     race2 |               5 |     runner2 |
|     race3 |               1 |     runner6 |
|     race3 |               2 |     runner2 |
|     race3 |               3 |     runner7 |
|     race3 |               4 |     runner4 |
|     race3 |               5 |     runner9 |

Check it all at SQLFiddle

NOTE: Your race_result table doesn't need a surrogate primary key. There's a natural key (race_id, runner_id) or (race_id, finish_position).

Related Solutions

Postgresql – psql: id SERIAL PRIMARY KEY not starting with 1

The table should be

create table cali (
    id serial primary key,
    alk_from bigint,
    alk_to bigint,
    and_from int,
    and_to int
);

And do

\COPY cali (alk_from, alk_to, and_from, and_to) 
FROM '/home/.../data/output/id_cali.csv' 
(FORMAT CSV);

So that it knows to not insert into the id serial column. The serial type is not a true type. From the manual:

The data types smallserial, serial and bigserial are not true types, but merely a notational convenience for creating unique identifier columns (similar to the AUTO_INCREMENT property supported by some other databases)

Thus, we have created an integer column and arranged for its default values to be assigned from a sequence generator

Postgresql – How to split table into two tables in one transaction or query

I cannot understand how to insert each row of one table into another table and update the same row of first table with returned id

You don't have to use an intermediate table. It can be done in one statement, with a writeable CTE:

WITH t as
 (INSERT INTO media(origin) SELECT DISTINCT avatar FROM profile RETURNING id,origin)  
 UPDATE profile SET avatar_id=t.id FROM t WHERE avatar=t.origin;

(assuming that media.id is automatically set by its DEFAULT clause).

Or the same as two statements in a transaction, using a different form of update:

BEGIN;
INSERT INTO media(origin) SELECT DISTINCT avatar FROM profile;
UPDATE profile SET avatar_id=(SELECT id FROM media WHERE profile.avatar=media.origin);
COMMIT;

If you know that profile.avatar is unique, DISTINCT can be omitted, otherwise it's required.

Best Answer

Related Solutions

Postgresql – psql: id SERIAL PRIMARY KEY not starting with 1

Postgresql – How to split table into two tables in one transaction or query

Related Question