Postgresql – Bulk insert M:N relation in PostgreSQL

migrationpostgresql

I need to import data from an old database to a new one, with slightly different structure. For example, in the old database, there is a table recording employees and their supervisors:

CREATE TABLE employee (ident TEXT PRIMARY KEY, name TEXT, supervisor_name TEXT)

Now, the new database is as follows:

CREATE TABLE person (id BIGSERIAL PRIMARY KEY, name TEXT, old_ident TEXT);
CREATE TABLE team (id BIGSERIAL PRIMARY KEY);
CREATE TABLE teammember (person_id BIGINT, team_id BIGINT, role CHAR(1));

That is, instead of plain table of employees with their supervisors' names, the new (more generic) database enables to create teams of people. The employees are members with role 'e', supervisors with role 's'.

The question is how to easily migrate the data from employee to the new structure, one team per employee-supervisor pair. For example, employees

employee: ('abc01', 'John', 'Dave'), ('abc02', 'Kyle', 'Emily')

are to be migrated as

person: (1, 'John', 'abc01'), (2, 'Dave', NULL), (3, 'Kyle', 'abc02'), (4, 'Emily', NULL)
team: (1), (2)
teammember: (1, 1, 'e'), (2, 1, 's'), (3, 2, 'e'), (4, 2, 's')

I would consider using a data-modifying CTE, inserting the employees and supervisors first, then teams among them. However, CTE may only return data from the inserted table row. Thus, I am not able to match who was the supervisor of who.

The only solution I can see is using plpgsql, which would simply iterate over the data, hold the inserted team IDs in a temporary variable, and then insert the appropriate teammember rows. But I am curious whether there are simpler or more elegant solutions.

There will be roughly several hundreds to several thousands of employees. Although it is generally a good practice, in my case, I wouldn't like to generate the new IDs based on the old ones, as the old IDs are strings like *.GM2. I store them to the old_ident column for reference.

Best Answer

You have all the information you need to populate the new database from the old one with 4 insert statements:

create table team_ids (id serial, name TEXT)

insert into team_ids (name)
select distinct supervisor_name from employee

-- now supervisors have ids assigned by "serial" type

insert into person (id, name, old_ident)
select ident, name, ident from employee
union
select ident, supervisor_name, ident from employee

insert into team (id) -- meh
select id from team_ids

insert into teammember (person_id, team_id, role)
select e.ident, t.id, 'e')
from employee as e, join team_ids as t
on t.name = e.supervisor_name
union -- and, I guess
select t.id, t.id, 'm')
from team_ids as t

You may have to adjust to taste. I'm assuming employee.ident can be mapped onto person.id, and that your DBMS allows assigning values to columns with auto-generated values. Except for that, it's just basic SQL, nothing fancy and, of course, no loops.

Additional commentary:

The 'team' table might be (more conventionally) renamed to department.
A SERIAL (with its 2 billion possibilities) should be plenty, no need for a BIGSERIAL.
There appears to be no database mechanism to enforce 1:1 cardinality of manager to team. Doesn't every team need a leader, by definition? Is there not a CHECK or FOREIGN KEY constraint for teammember.role? Perhaps the question simplified these details away.
The "teammember" table name would more conventionally have a word boundary, for example TeamMember or team_member.

Best Answer

Related Solutions

How to Determine if a Column is Defined as Serial Data Type in PostgreSQL

PostgreSQL: How does the implicit index work when using “COPY” for loading bulk rows into a newly created table

Related Question