Postgresql – Optimizing one-way syncing of large (wide) tree

optimizationpostgresqlpostgresql-11tree

I deal with several trees that can vary in size (they all have depths of ~8-10) but it's not unusual for them to consist of ~50-100k nodes.

At the moment, the trees are stored using simple parent_id FKs.

A tree is created/generated in a program, and then it needs to be saved in the database. The catch is that we cannot just wipe the previous tree (of the same type, if one exists), we need to update it to reflect the new (changed) tree. Occasionally this means swapping the entire tree (if the root changed), but a lot of the time it just means minor changes here and there (e.g. add some children/sub-trees, retire others). We also do not delete old children (and their children), they just get marked as old.

The way updates/verifications are done is by traversing through the tree (depth-first) and checking each node one-by-one.

One of the performance issues is that it can take almost 2 minutes to just verify that a tree (~50k nodes) in the db is the same as the one in the program (since it involves n selects where n is the number of total nodes).

While utilizing an ltree structure could be beneficial in terms of reads, I don't think it would solve this issue (inserting/updating/verifying a tree).

I'm not sure what the best way to optimize this is. One idea I had was to calculate and store a hash of every sub-tree, and that way it could potentially speed things up in most cases since it would know what parts of the tree that doesn't need touching. The hash calculation should be pretty fast and can be parallelized in the program, where the workload is easier to manage. However, perhaps there is some other way that I'm missing?

Bulk inserts etc. are impossible since the tree relies on recursive relationships.

Furthermore, the database is naturally in the cloud, which means that these thousands of queries being executed one-by-one probably incur quite a bit of time just in terms of latency (both the program and the db are on AWS so the latency is small, but still exists of course). A part of me wants to bulk insert the newly generated tree into some temporary table and then write a DB function that syncs it into the actual table.

General structure/schema of a tree:

  • Each node has an id and a parent_id column (nullable FK to id)
  • The root node has parent_id=NULL
  • What makes each node unique is (parent_id, attrs…)
    • Meaning that duplicate nodes can exist in multiple places in the tree
  • Nodes are only added or marked as obsolete

Any advice is much appreciated!

Best Answer

It's look like that is more efficient to dump the new state of the tree into the base along with the old state and then compare them with a single CTE.

Here is an approach I've used in the MariaDB but I hope PostgreSQL allows the same with minor changes:

WITH RECURSIVE 
   old_tree (dpth, id, parent_id, attr1, attr2 ...) AS (
      SELECT 1, t.*, a.*                          -- initial portion
        FROM trees  AS t                          -- the table containing all the trees
        JOIN attrs  AS a ON a.node_id = t.id      -- attributes for node
       WHERE t.id = 1234                          -- root node for the old tree
      UNION ALL
      SELECT o.dpth+1, t.*, a.*                   -- recurrent portion
        FROM old_tree AS o                        -- this is recursive CTE
        JOIN trees    AS t ON t.parent_id = o.id  -- childs of the nodes already fetched 
        JOIN attrs    AS a ON a.node_id = t.id    -- attributes for node(s)
      ), 

   new_tree (dpth, id, parent_id, attr1, attr2 ...) AS (
      SELECT 1, t.*, a.*                          -- all the same for the new_tree
        FROM trees  AS t                          
        JOIN attrs  AS a ON a.node_id = t.id      
       WHERE t.id = 4567                          -- root node for the new tree
      UNION ALL
      SELECT n.dpth+1, t.*, a.*                                    
        FROM new_tree AS n                        -- this is recursive CTE too
        JOIN trees    AS t ON t.parent_id = n.id  
        JOIN attrs    AS a ON a.node_id = t.id    
      )
SELECT ot.*, nt.*                     -- should be aliased to avoid ambiguity
  FROM            old_tree AS ot 
  /* there is no FOJ in mysql/mariadb but I've used it to be more clear */
  FULL OUTER JOIN new_tree AS nt  ON nt.attr1 = ot.attr1 -- or any other node identifier(s)
                                 AND nt.dpth = ot.dpth   -- and by same depth of recursion
;

As a result you'll get the table like that:

+------+------+------+------+------+------+------+------+
|  oid | opid | oat1 | oat2 |  nid | npid | nat1 | nat2 |
+------+------+------+------+------+------+------+------+
| 1234 | NULL |    X |    x | 4567 | NULL |    X |    x | -- unchanged
| 1235 | 1234 |    Y |    y | NULL | NULL | NULL | NULL | -- deleted
| NULL | NULL | NULL | NULL | 4568 | 4567 |    A |    B | -- added
| 1357 | 1234 |    Q |    a | 9876 | 4567 |    Q |    z | -- modified
| .... | .... | .... | .... | .... | .... | .... | .... |

The final SELECT of the WITH can be used to filter out only changed/added/deleted nodes. Then you can modify the old tree structure or log the changes or else.