PostgreSQL – Upsert into Self-Referencing Table with Single Statement

postgresqlself-joinupsert

If I have a table A, like so:

A {
 id SERIAL
 title TEXT
 ...
 parentId INT references A.id via a foreign key constraint
}

I am pulling data from a source table – A_SOURCE – where there isn't a parentId column. Instead there is a parentTitle column. So the source table looks something like this:

A_SOURCE {
  title TEXT
  parentTitle TEXT
}

I started writing an upsert statement to insert into table A via a selection from table A_SOURCE before I realized that I can't easily resolve the parentTitle column in the source to a parentId in the target.

Since I can't be sure that the parent will have been inserted at the time the child being processed, a join or a subquery could return no results.

My upsert statement is looking something like this:

with source as (
  select
   title
   parentTitle
  from A_SOURCE 
)

insert into A
select
title
... I don't think I can resolve to parentId here?
from source
on concflict ...;

I know that I can run two separate statements:

  1. insert with null as parentId
  2. Then update the parentId fields in the second statement

But is it possible to do this in a single query?

Best Answer

Prepare the values to insert with a recursive CTE that pre-calculates the ids and defines an order. Then insert it in that order:

WITH RECURSIVE a_sort AS (
      /* get all entries without a parent (rank 1) */
      SELECT nextval('a_id_seq')::integer AS id,
             title,
             NULL::integer AS parentid,
             parenttitle,
             1 AS rank
      FROM a_source
      WHERE parenttitle IS NULL
   UNION ALL
      /* recursively get all the immediate children and increase rank */
      SELECT nextval('a_id_seq')::integer,
             src.title,
             a_sort.id,
             a_sort.title,
             a_sort.rank + 1
      FROM a_source AS src
         JOIN a_sort ON a_sort.title = src.parenttitle
)
INSERT INTO a
SELECT id,
       title,
       parentid
FROM a_sort
ORDER BY rank;

The beauty of the solution is that this uses the sequence behind a.id (a_id_seq in the example), so the sequence automatically has the correct value after we are done.

This solution assumes that the data in a_source are correct, i.e., do not contain cycles.