Postgresql – Help with Recursive CTE query with INSERT, ON CONFLICT and RETURNING

postgresqlpostgresql-9.5

I'm trying to write a query to insert filenames to the following table table path

CREATE SEQUENCE path_id_seq;
CREATE TABLE path (
    id INTEGER PRIMARY KEY DEFAULT NEXTVAL('path_id_seq'),
    parent_id INTEGER,
    name TEXT NOT NULL);

CREATE UNIQUE INDEX path_parent_id_name_ix
    ON path (COALESCE(parent_id, 0), name);

ALTER TABLE path
    ADD CONSTRAINT ios_path_parent_id_fk
    FOREIGN KEY (parent_id) REFERENCES path (id)
    ON DELETE CASCADE
    DEFERRABLE INITIALLY DEFERRED;

For example, given filenames Assets/images/a.png and Assets/images/b.png I'd like to insert the following rows.

id | parent_id |       name        
----+-----------+-------------------
  1 |         ¤ | Assets
  2 |         1 | images
  3 |         2 | a.png
  4 |         2 | b.png

Inserting the parent is easy;

INSERT INTO path (name)
VALUES ('Assets')
ON CONFLICT (COALESCE(parent_id, 0), name)
DO UPDATE SET name = EXCLUDED.name -- this strange update is so that we get rows back
RETURNING id, parent_id, name;

which yields;

 id | parent_id |  name  
----+-----------+--------
  1 |         ¤ | Assets

But I can't seem to put together a recursive CTE to use the parent's id as the parent_id in the next INSERT statement.

It will probably look somethings like this invalid query below.

WITH RECURSIVE names AS (
    SELECT name, lag(name, 1) OVER () as previous_name
    FROM (SELECT unnest(string_to_array('Assets/images/232.png', '/')) as name) names),

paths(id, name) AS (
    INSERT INTO path (name)
    SELECT name
    FROM names
    WHERE previous_name IS NULL
    ON CONFLICT (COALESCE(parent_id, 0), name)
    DO UPDATE SET name = EXCLUDED.name
    RETURNING id, name
    UNION
    INSERT INTO path (parent_id, name)
    SELECT paths.id, names.name
    FROM paths
    JOIN names ON names.previous_name = paths.name
    ON CONFLICT (COALESCE(parent_id, 0), name)
    DO UPDATE SET name = EXCLUDED.name
    RETURNING id, name)

SELECT *
FROM paths;

But I just can't seem to get it to work. If anyone sees a simpler way to do this it would be much appreciated.

Best Answer

You don't need a recursive CTE, you can use nextval() in the result from the unnest() directly and then use lag() to get the value from the previous row. To deal with existing rows, I would do this using a join between the result of the unnest() and the path table, detecting new and existing rows.

For the new rows, a new ID will be generated and only those will be inserted:

insert into path (id, name, parent_id)
select id, name, parent_id
from (
  select *,
         case 
           when is_new
             then lag(id) over (order by level) 
             else existing_parent
         end as parent_id
  from (
    SELECT case 
              when p.id is null then nextval('path_id_seq') 
              else p.id
           end as id, 
           x.name, x.level, p.parent_id as existing_parent, 
           p.id is null as is_new        
    FROM unnest(string_to_array('Assets/images/232.png', '/')) with ordinality as x(name, level)
      LEFT JOIN path p on p.name = x.name
    order by level
  ) t1
) t2
where is_new;

Live example: http://rextester.com/YGMT19565

The only "drawback" if you will is that the sequence values are not necessarily assigned in the "order" you'd expect. But as the only purpose of a generated primary key is to be unique (and nothing else) this shouldn't be a problem.

Related Solutions

PostgreSQL tree structure and recursive CTE optimization

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.

Postgresql – Recursive CTE to find unique slug

First off, you do not want to use ~~char(50)~~. Use varchar(50) or just text. Read more:

Any downsides of using data type “text” for storing strings?

Assuming the following rules:

Basic slugs never end with a dash.
Duplicate slugs are suffixed with a dash and a sequential number (-123).

Note that all of the following methods are subject to a race conditions: concurrent operations might identify the same "free" name for the next slug.
To defend against it, you can impose a UNIQUE constraint on slug and be prepared to repeat an INSERT upon duplicate key violation or you to take out a write lock on the table at the start of the transaction.

If you glue the suffix to the basic slug name with a dash and allow basic slugs to end in separate numbers, the specification is a tiny bit ambiguous (see comments). I suggest a unique delimiter of your choice instead (which is otherwise disallowed).

Efficient rCTE

WITH RECURSIVE
  input AS (SELECT 'news-on-apple'::text AS slug)  -- input basic slug here once
, cte   AS (
   SELECT slug || '-' AS slug  -- append '-' once, if basic slug exists
        , 1 as suffix          -- start with suffix 1
   FROM   article
   JOIN   input USING (slug)

   UNION ALL
   SELECT c.slug, c.suffix + 1  -- increment by 1 ...
   FROM   cte     c
   JOIN   article a ON a.slug = c.slug || c.suffix  -- ... if slug-n already exists
   )
(
SELECT slug || suffix AS slug
FROM   cte
ORDER  BY suffix DESC  -- pick the last (free) one
LIMIT  1
)  -- parentheses required
UNION  ALL  -- if the basic slug wasn't taken, fall back to that
SELECT slug FROM input
LIMIT  1;

Better performance without rCTE

If you worry about thousands of slugs competing for the same slug or generally want to optimize performance, I'd consider a different, faster approach.

WITH input AS (SELECT 'news-on-apple'::text  AS slug
                    , 'news-on-apple-'::text AS slug1)  -- input basic slug here
SELECT i.slug
FROM   input        i
LEFT   JOIN article a USING (slug)
WHERE  a.slug IS NULL  -- doesn't exist yet.

UNION ALL
(  -- parentheses required
SELECT i.slug1 || COALESCE(right(a.slug, length(i.slug1) * -1)::int + 1, 1)
FROM   input        i
LEFT   JOIN article a ON a.slug LIKE (i.slug1 || '%')  -- match up to last "-"
                     AND right(a.slug, length(i.slug1) * -1) ~ '^\d+$' -- suffix numbers only
ORDER  BY right(a.slug, length(i.slug1) * -1)::int DESC
)
LIMIT  1;

If the basic slug isn't taken yet, the more expensive second SELECT is never executed - same as above, but much more important here. Check with EXPLAIN ANALYZE, Postgres is smart that way with LIMIT queries. Related:
- Optimize a query on two big tables
Check for the leading string and the suffix separately, so the LIKE expression can use a basic btree index with text_pattern_ops like
```
CREATE INDEX article_slug_idx ON article (slug text_pattern_ops);
```
Detailed explanation:
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Convert the suffix to integer before you apply max(). Numbers in text representation don't work.

Optimize performance

To get the optimum, consider storing the suffix separated from the basic slug and concatenate the slug as needed: concat_ws('-' , slug, suffix::text) AS slug

CREATE TABLE article (
   article_id serial PRIMARY KEY
 , title text NOT NULL
 , slug  text NOT NULL
 , suffix int
);

The query for a new slug then becomes:

SELECT slug
    || COALESCE((
          SELECT '-'::text || (max(suffix) + 1)::text
          FROM   article a
          WHERE  a.slug = i.slug), '') As slug
FROM  (SELECT 'news-on-apple'::text AS slug) i  -- input basic slug here

Ideally supported with a unique index on (slug, suffix).

Query for list of slugs

In any version of Postgres you can provide rows in a VALUES expression.

SELECT *
FROM   article
JOIN  (
   VALUES
     ('slug-foo'::text, 1)
     ('slug-bar',7)
   ) u(slug,suffix) USING (slug,suffix);

You can also use IN with a set of row-type expressions Which is shorter:

SELECT *
FROM   article
WHERE (slug,suffix) IN (('slug-foo', 1), ('slug-bar',7));

Details under this related question (as commented below):

the <set clause>'s <multiple column assignment>

For long lists, the JOIN to a VALUES expression is typically faster.

In Postgres 9.4 (released today!) you can also use the new variant of unnest() to unnest multiple arrays in parallel.

Given an array of basic slugs and a corresponding array of suffixes (as per comment):

SELECT *
FROM   article
JOIN   unnest('{slug-foo,slug-bar}'::text[]
            , '{1,7}'::int[]) AS u(slug,suffix) USING (slug,suffix);