Postgresql – Optimizing one-way syncing of large (wide) tree

optimizationpostgresqlpostgresql-11tree

I deal with several trees that can vary in size (they all have depths of ~8-10) but it's not unusual for them to consist of ~50-100k nodes.

At the moment, the trees are stored using simple parent_id FKs.

A tree is created/generated in a program, and then it needs to be saved in the database. The catch is that we cannot just wipe the previous tree (of the same type, if one exists), we need to update it to reflect the new (changed) tree. Occasionally this means swapping the entire tree (if the root changed), but a lot of the time it just means minor changes here and there (e.g. add some children/sub-trees, retire others). We also do not delete old children (and their children), they just get marked as old.

The way updates/verifications are done is by traversing through the tree (depth-first) and checking each node one-by-one.

One of the performance issues is that it can take almost 2 minutes to just verify that a tree (~50k nodes) in the db is the same as the one in the program (since it involves n selects where n is the number of total nodes).

While utilizing an ltree structure could be beneficial in terms of reads, I don't think it would solve this issue (inserting/updating/verifying a tree).

I'm not sure what the best way to optimize this is. One idea I had was to calculate and store a hash of every sub-tree, and that way it could potentially speed things up in most cases since it would know what parts of the tree that doesn't need touching. The hash calculation should be pretty fast and can be parallelized in the program, where the workload is easier to manage. However, perhaps there is some other way that I'm missing?

Bulk inserts etc. are impossible since the tree relies on recursive relationships.

Furthermore, the database is naturally in the cloud, which means that these thousands of queries being executed one-by-one probably incur quite a bit of time just in terms of latency (both the program and the db are on AWS so the latency is small, but still exists of course). A part of me wants to bulk insert the newly generated tree into some temporary table and then write a DB function that syncs it into the actual table.

General structure/schema of a tree:

Each node has an id and a parent_id column (nullable FK to id)
The root node has parent_id=NULL
What makes each node unique is (parent_id, attrs…)
- Meaning that duplicate nodes can exist in multiple places in the tree
Nodes are only added or marked as obsolete

Any advice is much appreciated!

Best Answer

It's look like that is more efficient to dump the new state of the tree into the base along with the old state and then compare them with a single CTE.

Here is an approach I've used in the MariaDB but I hope PostgreSQL allows the same with minor changes:

WITH RECURSIVE 
   old_tree (dpth, id, parent_id, attr1, attr2 ...) AS (
      SELECT 1, t.*, a.*                          -- initial portion
        FROM trees  AS t                          -- the table containing all the trees
        JOIN attrs  AS a ON a.node_id = t.id      -- attributes for node
       WHERE t.id = 1234                          -- root node for the old tree
      UNION ALL
      SELECT o.dpth+1, t.*, a.*                   -- recurrent portion
        FROM old_tree AS o                        -- this is recursive CTE
        JOIN trees    AS t ON t.parent_id = o.id  -- childs of the nodes already fetched 
        JOIN attrs    AS a ON a.node_id = t.id    -- attributes for node(s)
      ), 

   new_tree (dpth, id, parent_id, attr1, attr2 ...) AS (
      SELECT 1, t.*, a.*                          -- all the same for the new_tree
        FROM trees  AS t                          
        JOIN attrs  AS a ON a.node_id = t.id      
       WHERE t.id = 4567                          -- root node for the new tree
      UNION ALL
      SELECT n.dpth+1, t.*, a.*                                    
        FROM new_tree AS n                        -- this is recursive CTE too
        JOIN trees    AS t ON t.parent_id = n.id  
        JOIN attrs    AS a ON a.node_id = t.id    
      )
SELECT ot.*, nt.*                     -- should be aliased to avoid ambiguity
  FROM            old_tree AS ot 
  /* there is no FOJ in mysql/mariadb but I've used it to be more clear */
  FULL OUTER JOIN new_tree AS nt  ON nt.attr1 = ot.attr1 -- or any other node identifier(s)
                                 AND nt.dpth = ot.dpth   -- and by same depth of recursion
;

As a result you'll get the table like that:

+------+------+------+------+------+------+------+------+
|  oid | opid | oat1 | oat2 |  nid | npid | nat1 | nat2 |
+------+------+------+------+------+------+------+------+
| 1234 | NULL |    X |    x | 4567 | NULL |    X |    x | -- unchanged
| 1235 | 1234 |    Y |    y | NULL | NULL | NULL | NULL | -- deleted
| NULL | NULL | NULL | NULL | 4568 | 4567 |    A |    B | -- added
| 1357 | 1234 |    Q |    a | 9876 | 4567 |    Q |    z | -- modified
| .... | .... | .... | .... | .... | .... | .... | .... |

The final SELECT of the WITH can be used to filter out only changed/added/deleted nodes. Then you can modify the old tree structure or log the changes or else.

Related Solutions

PostgreSQL tree structure and recursive CTE optimization

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.

Postgresql – How to sort the results of a recursive query in an expanded tree-like fashion

An array representing the path from the root up to the leaf should achieve the desired sort order:

WITH RECURSIVE node_rec AS (
   (SELECT 1 AS depth, ARRAY[node] AS path, *
    FROM   nodes
    WHERE  parent IS NULL
    LIMIT  10
   )    
    UNION ALL
    SELECT r.depth + 1, r.path || n.node, n.*
    FROM   node_rec r 
    JOIN   nodes    n ON n.parent = r.node
    WHERE  r.depth < 4
)
SELECT *
FROM   node_rec
ORDER  BY path;

If two or more child nodes have the same parent node, I want them to be sorted by their timestamp.

Shift the path by one towards the root and order by that column additionally:

WITH RECURSIVE node_rec AS (
   (SELECT 1 AS depth, ARRAY[node] AS path, *
    FROM   nodes
    WHERE  parent IS NULL
    LIMIT  10
   )    
    UNION ALL
    SELECT r.depth + 1, r.path || n.parent, n.*
    FROM   node_rec r 
    JOIN   nodes    n ON n.parent = r.node
    WHERE  r.depth < 4
)
SELECT *
FROM   node_rec
ORDER  BY path, ts;

Best Answer

Related Solutions

PostgreSQL tree structure and recursive CTE optimization

Postgresql – How to sort the results of a recursive query in an expanded tree-like fashion

Related Question