PostgreSQL Recursive Descendant Depth – How to Implement

ctepostgresqlrecursiveupdate

I need to calculate the depth of a descendant from it's ancestor. When a record has object_id = parent_id = ancestor_id, it is considered a root node (the ancestor). I have been trying to get a WITH RECURSIVE query running with PostgreSQL 9.4.

I do not control the data or the columns. The data and table schema comes from an external source. The table is growing continuously. Right now by about 30k records per day. Any node in the tree can be missing and they will be pulled from an external source at some point. They are usually pulled in created_at DESC order but the data is pulled with asynchronous background jobs.

We initially had a code solution to this problem, but now having 5M+ rows, it takes almost 30 minutes to complete.

Example table definition and test data:

CREATE TABLE objects (
  id          serial NOT NULL PRIMARY KEY,
  customer_id integer NOT NULL,
  object_id   integer NOT NULL,
  parent_id   integer,
  ancestor_id integer,
  generation  integer NOT NULL DEFAULT 0
);

INSERT INTO objects(id, customer_id , object_id, parent_id, ancestor_id, generation)
VALUES (2, 1, 2, 1, 1, -1), --no parent yet
       (3, 2, 3, 3, 3, -1), --root node
       (4, 2, 4, 3, 3, -1), --depth 1
       (5, 2, 5, 4, 3, -1), --depth 2
       (6, 2, 6, 5, 3, -1), --depth 3
       (7, 1, 7, 7, 7, -1), --root node
       (8, 1, 8, 7, 7, -1), --depth 1
       (9, 1, 9, 8, 7, -1); --depth 2

Note that object_id is not unique, but the combination (customer_id, object_id) is unique.
Running a query like this:

WITH RECURSIVE descendants(id, customer_id, object_id, parent_id, ancestor_id, depth) AS (
  SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
  FROM objects
  WHERE object_id = parent_id

  UNION

  SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
  FROM objects o
  INNER JOIN descendants d ON d.parent_id = o.object_id
  WHERE
    d.id <> o.id
  AND
    d.customer_id = o.customer_id
) SELECT * FROM descendants d;

I would like the generation column to be set as the depth that was calculated. When a new record is added, the generation column is set as -1. There are some cases where a parent_id may not have been pulled yet. If the parent_id does not exist, it should leave the generation column set to -1.

The final data should look like:

id | customer_id | object_id | parent_id | ancestor_id | generation
2    1             2           1           1            -1
3    2             3           3           3             0
4    2             4           3           3             1
5    2             5           4           3             2
6    2             6           5           3             3
7    1             7           7           7             0
8    1             8           7           7             1
9    1             9           8           7             2

The result of the query should be to update the generation column to the correct depth.

I started working from the answers to this related question on SO.

Best Answer

The query you have is basically correct. The only mistake is in the second (recursive) part of the CTE where you have:

INNER JOIN descendants d ON d.parent_id = o.object_id

It should be the other way around:

INNER JOIN descendants d ON d.object_id = o.parent_id

You want to join the objects with their parents (that have already been found).

So the query that calculates depth can be written (nothing else changed, only formatting):

-- calculate generation / depth, no updates
WITH RECURSIVE descendants
  (id, customer_id, object_id, parent_id, ancestor_id, depth) AS
 AS ( SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
      FROM objects
      WHERE object_id = parent_id

      UNION ALL

      SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
      FROM objects o
      INNER JOIN descendants d ON  d.customer_id = o.customer_id
                               AND d.object_id = o.parent_id  
      WHERE d.id <> o.id
    ) 
SELECT * 
FROM descendants d
ORDER BY id ;

For the update, you simply replace the last SELECT, with the UPDATE, joining the result of the cte, back to the table:

-- update nodes
WITH RECURSIVE descendants
    -- nothing changes here except
    -- ancestor_id and parent_id 
    -- which can be omitted form the select lists
    ) 
UPDATE objects o 
SET generation = d.depth 
FROM descendants d
WHERE o.id = d.id 
  AND o.generation = -1 ;          -- skip unnecessary updates

Tested on SQLfiddle

Additional comments:

the ancestor_id and the parent_id are not needed to be in the select list (ancestor is obvious, parent a bit tricky to figure out why), so you can keep them in the SELECT query if you want but you can safely remove them from the UPDATE.
the (customer_id, object_id) seems like a candidate for a UNIQUE constraint. If your data comply with this, add such a constraint. The joins performed in the recursive CTE would not make sense if it wasn't unique (a node could have 2 parents otherwise).
if you add that constraint, the (customer_id, parent_id) would be a candidate for a FOREIGN KEY constraint that REFERENCES the (unique) (customer_id, object_id). You most probably do not want to add that FK constraint though, since by your description, you are adding new rows and some rows can reference others that haven't been yet added.
There are certainly problems with the efficiency of the query, if it's going to be performed in a big table. Not in the first run, as almost the whole table will be updated anyway. But the second time, you'll want only new rows (and those that were not touched by the 1st run) to be considered for update. The CTE as it is will have to build a big result.
The AND o.generation = -1 in the final update will make sure that the rows that were updated in the 1st run will not be updated again but the CTE is still an expensive part.

The following is an attempt to address these issues: improve the CTE as to consider as few rows as possible and use (customer_id, obejct_id) instead of (id) to identify rows (so id is completely removed from the query. It can be used as the 1st update or a subsequent:

WITH RECURSIVE descendants 
  (customer_id, object_id, depth) 
 AS ( SELECT customer_id, object_id, 0
      FROM objects
      WHERE object_id = parent_id
        AND generation = -1

      UNION ALL

      SELECT o.customer_id, o.object_id, p.generation + 1
      FROM objects o
        JOIN objects p ON  p.customer_id = o.customer_id
                       AND p.object_id = o.parent_id
                       AND p.generation > -1
      WHERE o.generation = -1

      UNION ALL

      SELECT o.customer_id, o.object_id, d.depth + 1
      FROM objects o
      INNER JOIN descendants d ON  o.customer_id = d.customer_id
                               AND o.parent_id = d.object_id
      WHERE o.parent_id <> o.object_id
        AND o.generation = -1
    )
UPDATE objects o 
SET generation = d.depth 
FROM descendants d
WHERE o.customer_id = d.customer_id
  AND o.object_id = d.object_id
  AND o.generation = -1        -- this is not really needed

Note how the CTE has 3 parts. The first two are the stable parts. The 1st part find the root nodes that haven't been updated before and have still generation=-1 so they must be newly added nodes. The 2nd part finds children (with generation=-1) of parent nodes that have previously been updated.
The 3rd, recursive part, finds all the descendants of the first two parts, as before.

Tested on SQLfiddle-2

Related Solutions

PostgreSQL tree structure and recursive CTE optimization

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.

PostgreSQL Update – Most Efficient Way to Perform Multiple Updates

If you don't want to interfere with other activity, the one UPDATE at a time in autocommit mode is very likely the best option. You should probably set synchronous_commit=off in that session (and only that session).

The indexes are going to slow you down, perhaps by a lot depending on your RAM and your IO system. But if the index is necessary for the other actions you don't want to interfere with, then there isn't anything you can do about it.

But since the fid is not yet correctly populated, the index on it is probably not actually useful to the concurrent processes you want avoid interfering with, as they haven't been changed yet to rely on that column being accurate. If that is the case, you can drop that index to gain speed, and build it in bulk later. The same probably applies to the foreign key constraint.

Once that index is gone, your updates can proceed via HOT (Heap Only Tuples) updates provided each block has enough free space. In that case, the updates will not have to do maintenance on the primary key index, either, saving that much more IO. To maximize the likelihood that this will work optimally, it is important that each UPDATE be its own transaction. That way one UPDATE can reuse space freed up by an earlier one.

Also, your WHERE clause should probably be like:

WHERE id=345 and fid is not null;

That way if the script gets interrupted, you can re-run it with minimal damage.

Since you seem to be running this on a test system already, then an EXPLAIN (ANALYZE,BUFFERS) of some of the updates would be helpful, especially with track_io_timing set to on.

Best Answer

Related Solutions

PostgreSQL tree structure and recursive CTE optimization

PostgreSQL Update – Most Efficient Way to Perform Multiple Updates

Related Question