PostgreSQL – Split Table and Set Up References

bulk-insertjoin;postgresql

My title is a little misleading, what I am doing is inserting data from a CSV file where each row has information corresponding to three different tables. You can imagine the schema being something like this:

CREATE TABLE runner (id integer PK, name varchar);
CREATE TABLE race (id integer PK, name varchar);
CREATE TABLE race_result (id integer PK, race_id integer, runner_id ineteger, fp integer);

When I receive the data, it is in a CSV file where each row looks like:

runner_name, race_name, finish_position

Each runner can have multiple race results and so can each race. I am trying to figure out an efficient way to insert this data. I can first insert all runners, then insert all races, then finally insert all race_results and loop up all the ids to fill in the FKs, but this would be awfully slow. Another thing I am trying is:

CREATE TEMP TABLE tmp_x (run_name vc, race_name vc, fp int);
COPY tmp_x FROM 'input.csv' (FORMAT csv);

WITH run_ins AS (
  INSERT INTO runner (name)
    SELECT run_name
    FROM tmp_x
  ON CONFLICT ON some_contraint DO NOTHING
  RETURNING id
), rac_ins AS (
  INSERT INTO race (name)
    SELECT race_name
    FROM tmp_x
  ON CONFLICT ON some_constraint DO NOTHING
  RETURNING id)
INSERT INTO race_result (race_id, runner_id, fp)
  SELECT rc.id, ru.id, t.fp
  FROM rac_ins rc, run_ins ru, tmp_x t;

DROP TABLE tmp_x;

The problem is that this joins the tables creating more results then there actually are. First, is this the best way to go about it? If it is, can I remedy this by explicitly JOINing the tables in some way? I think my problem is that I do not fully understand what happens when I use the WITH statement. Does it do all of the inserts first then creates a temp table/view that has all of the ids of the records that were inserted? If so, does it return an id twice if I tried to insert but there was a conflict on a constraint?

Thank you

Best Answer

Let's assume this is your tables (I've changed naming for clarity and ease of use):

CREATE TABLE runner 
(
    runner_id serial PRIMARY KEY,
    runner_name varchar UNIQUE
);

CREATE TABLE race 
(
    race_id serial PRIMARY KEY,
    race_name varchar UNIQUE
) ;

CREATE TABLE race_result 
(
    race_id integer REFERENCES race(race_id), 
    runner_id integer REFERENCES runner(runner_id), 
    finish_position integer NOT NULL,
    UNIQUE (race_id,finish_position),
    PRIMARY KEY (runner_id, race_id)
);

CREATE /* TEMP */ TABLE tmp_x 
(
    runner_name varchar, 
    race_name varchar, 
    finish_position integer
);

And this is your data

-- COPY tmp_x FROM 'input.csv' (FORMAT csv);
INSERT INTO tmp_x
VALUES
  ('runner1', 'race1', 1),
  ('runner2', 'race1', 2),
  ('runner3', 'race1', 3),
  ('runner4', 'race1', 4),
  ('runner5', 'race1', 5),
  ('runner1', 'race2', 5),
  ('runner2', 'race2', 4),
  ('runner3', 'race2', 3),
  ('runner4', 'race2', 2),
  ('runner5', 'race2', 1),
  ('runner6', 'race3', 1),
  ('runner2', 'race3', 2),
  ('runner7', 'race3', 3),
  ('runner4', 'race3', 4),
  ('runner9', 'race3', 5) ;

I would insert the data in 3 steps. First, you INSERT in your runners table the ones you don't yet have. Then you do exactly the same for races.

-- Create new runners
INSERT INTO
   runner (runner_name)
SELECT DISTINCT
   runner_name
FROM
   tmp_x
WHERE
   NOT EXISTS (SELECT * 
                 FROM runner r0 
                WHERE r0.runner_name = tmp_x.runner_name) 
ORDER BY
    runner_name ;

-- Create new races
INSERT INTO
   race (race_name)
SELECT DISTINCT
   race_name
FROM
   tmp_x
WHERE
   NOT EXISTS (SELECT * 
                 FROM race r0 
                WHERE r0.race_name = tmp_x.race_name) 
ORDER BY
    race_name;

NOTE: Using "NOT EXISTS..." works with versions of PostgreSQL older than 9.6, and is (AFAIK) pure standard SQL; so, it should work on other DB as well.

And now that your master tables are filled in, you JOIN your temporary table to them, and add your finish positions. [I assume the combinations are new data, and we do not need to check for already existing data.]

INSERT INTO
    race_result (race_id, runner_id, finish_position)
SELECT
    race_id, runner_id, finish_position
FROM
    tmp_x
    JOIN runner USING(runner_name)
    JOIN race USING(race_name) 
ORDER BY
    race_id, runner_id, finish_position ;

At this point, you can query:

SELECT
  race_name, finish_position, runner_name
FROM
  race_result 
  JOIN runner USING(runner_id)
  JOIN race USING(race_id) 
ORDER BY
  race_name, finish_position, runner_name;

... and get

| race_name | finish_position | runner_name |
|-----------|-----------------|-------------|
|     race1 |               1 |     runner1 |
|     race1 |               2 |     runner2 |
|     race1 |               3 |     runner3 |
|     race1 |               4 |     runner4 |
|     race1 |               5 |     runner5 |
|     race2 |               1 |     runner5 |
|     race2 |               2 |     runner4 |
|     race2 |               3 |     runner3 |
|     race2 |               4 |     runner1 |
|     race2 |               5 |     runner2 |
|     race3 |               1 |     runner6 |
|     race3 |               2 |     runner2 |
|     race3 |               3 |     runner7 |
|     race3 |               4 |     runner4 |
|     race3 |               5 |     runner9 |

Check it all at SQLFiddle


NOTE: Your race_result table doesn't need a surrogate primary key. There's a natural key (race_id, runner_id) or (race_id, finish_position).