PostgreSQL – Upsert into Self-Referencing Table with Single Statement

postgresqlself-joinupsert

If I have a table A, like so:

A {
 id SERIAL
 title TEXT
 ...
 parentId INT references A.id via a foreign key constraint
}

I am pulling data from a source table – A_SOURCE – where there isn't a parentId column. Instead there is a parentTitle column. So the source table looks something like this:

A_SOURCE {
  title TEXT
  parentTitle TEXT
}

I started writing an upsert statement to insert into table A via a selection from table A_SOURCE before I realized that I can't easily resolve the parentTitle column in the source to a parentId in the target.

Since I can't be sure that the parent will have been inserted at the time the child being processed, a join or a subquery could return no results.

My upsert statement is looking something like this:

with source as (
  select
   title
   parentTitle
  from A_SOURCE 
)

insert into A
select
title
... I don't think I can resolve to parentId here?
from source
on concflict ...;

I know that I can run two separate statements:

insert with null as parentId
Then update the parentId fields in the second statement

But is it possible to do this in a single query?

Best Answer

Prepare the values to insert with a recursive CTE that pre-calculates the ids and defines an order. Then insert it in that order:

WITH RECURSIVE a_sort AS (
      /* get all entries without a parent (rank 1) */
      SELECT nextval('a_id_seq')::integer AS id,
             title,
             NULL::integer AS parentid,
             parenttitle,
             1 AS rank
      FROM a_source
      WHERE parenttitle IS NULL
   UNION ALL
      /* recursively get all the immediate children and increase rank */
      SELECT nextval('a_id_seq')::integer,
             src.title,
             a_sort.id,
             a_sort.title,
             a_sort.rank + 1
      FROM a_source AS src
         JOIN a_sort ON a_sort.title = src.parenttitle
)
INSERT INTO a
SELECT id,
       title,
       parentid
FROM a_sort
ORDER BY rank;

The beauty of the solution is that this uses the sequence behind a.id (a_id_seq in the example), so the sequence automatically has the correct value after we are done.

This solution assumes that the data in a_source are correct, i.e., do not contain cycles.

Related Solutions

MongoDB fast upsert

Since you are using mongoimport to do these upserts without --upsertFields, it will be using the _id index to do the upsert (see the note in the docs). That means that it will be scanning that index, and if you are including the _id field in the json dump that should be fine in terms of that particular search as long as that index is in memory. If you are not including the _id, then that will mean a full scan.

You can alter this behavior by using the aforementioned upsertFields option, though you will want to make sure those fields you pick are indexed (a compound index of the fields used rather than _id).

The overhead from the other indexes is actually probably just the overhead that occurs when you are updating several indexes as part of a data load.

Finally, if you are doing this repeatedly and intend to keep going, I would recommend creating your own tool to do this and having more control over the entire process. While mongoimport is a fine choice for simple tasks, more complex imports are better off handled by a more complex and customizable tool.

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Your problem is that when you add a new column to the_person, its child, the_person_two will have this field appended at the end of columns list (4th position), so after has_default column. See:

db=> \d temp_person
  Column   |       Type        |                            Modifiers                            
-----------+-------------------+-----------------------------------------------------------------
 person_id | integer           | not null default nextval('temp_person_person_id_seq'::regclass)
 name      | character varying | 
 foo       | text              | 

db=> \d temp_person_two 
   Column    |         Type         |                            Modifiers                            
-------------+----------------------+-----------------------------------------------------------------
 person_id   | integer              | not null default nextval('temp_person_person_id_seq'::regclass)
 name        | character varying    | 
 has_default | character varying(4) | not null default 'en'::character varying
 foo         | text                 |

So, when you execute this:

INSERT INTO temp_person_two VALUES ( NEW.* );

PostgreSQL will actually understand that you want to insert on the first three columns of temp_person_two (as NEW.* will expand to three values), generating something similar to this:

INSERT INTO temp_person_two(person_id,name,has_default)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

So, temp_person_two.has_default will get the value of NEW.foo, which is NULL in your case.

The solution is to simply expand the column names:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

or, you could also use this:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.* );

But this is weak, as any changes on column positions may break your statements, so I'd recommend the first one.

EDIT:

So the conclusion and the lesson learned here is:

Always explicitly type the names of the columns and the values when issuing an INSERT command, in fact, when issuing any SQL command at all... =D

This will save you a lot of time solving problems like that in future.

Best Answer

Related Solutions

MongoDB fast upsert

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Related Question