My title is a little misleading, what I am doing is inserting data from a CSV file where each row has information corresponding to three different tables. You can imagine the schema being something like this:
CREATE TABLE runner (id integer PK, name varchar);
CREATE TABLE race (id integer PK, name varchar);
CREATE TABLE race_result (id integer PK, race_id integer, runner_id ineteger, fp integer);
When I receive the data, it is in a CSV file where each row looks like:
runner_name, race_name, finish_position
Each runner can have multiple race results and so can each race. I am trying to figure out an efficient way to insert this data. I can first insert all runners, then insert all races, then finally insert all race_results and loop up all the ids to fill in the FKs, but this would be awfully slow. Another thing I am trying is:
CREATE TEMP TABLE tmp_x (run_name vc, race_name vc, fp int);
COPY tmp_x FROM 'input.csv' (FORMAT csv);
WITH run_ins AS (
INSERT INTO runner (name)
SELECT run_name
FROM tmp_x
ON CONFLICT ON some_contraint DO NOTHING
RETURNING id
), rac_ins AS (
INSERT INTO race (name)
SELECT race_name
FROM tmp_x
ON CONFLICT ON some_constraint DO NOTHING
RETURNING id)
INSERT INTO race_result (race_id, runner_id, fp)
SELECT rc.id, ru.id, t.fp
FROM rac_ins rc, run_ins ru, tmp_x t;
DROP TABLE tmp_x;
The problem is that this joins the tables creating more results then there actually are. First, is this the best way to go about it? If it is, can I remedy this by explicitly JOINing the tables in some way? I think my problem is that I do not fully understand what happens when I use the WITH statement. Does it do all of the inserts first then creates a temp table/view that has all of the ids of the records that were inserted? If so, does it return an id twice if I tried to insert but there was a conflict on a constraint?
Thank you
Best Answer
Let's assume this is your tables (I've changed naming for clarity and ease of use):
And this is your data
I would insert the data in 3 steps. First, you
INSERT
in yourrunner
s table the ones you don't yet have. Then you do exactly the same forrace
s.NOTE: Using "NOT EXISTS..." works with versions of PostgreSQL older than 9.6, and is (AFAIK) pure standard SQL; so, it should work on other DB as well.
And now that your master tables are filled in, you
JOIN
your temporary table to them, and add your finish positions. [I assume the combinations are new data, and we do not need to check for already existing data.]At this point, you can query:
... and get
Check it all at SQLFiddle
NOTE: Your race_result table doesn't need a surrogate primary key. There's a natural key (race_id, runner_id) or (race_id, finish_position).