PostgreSQL tree structure and recursive CTE optimization

cteoptimizationperformancepostgresqltree

I'm trying to represent a tree structure in PostgreSQL (8.4) to be able to query the path from the root to a given node or to find all the nodes within a sub-branch.

Here is a test table:

CREATE TABLE tree_data_1 (
    forest_id TEXT NOT NULL,
    node_id TEXT NOT NULL,
    parent_id TEXT,
    node_type TEXT,
    description TEXT,
    PRIMARY KEY (forest_id, node_id),
    FOREIGN KEY (forest_id, parent_id) REFERENCES tree_data_1 (forest_id, node_id)
);
CREATE INDEX tree_data_1_forestid_parent_idx ON tree_data_1(forest_id, parent_id);
CREATE INDEX tree_data_1_forestid_idx ON tree_data_1(forest_id);
CREATE INDEX tree_data_1_nodeid_idx ON tree_data_1(node_id);
CREATE INDEX tree_data_1_parent_idx ON tree_data_1(parent_id);

Each node is identified by (forest_id, node_id) (there can be another node with the same name in another forest). Each tree starts at a root node (where parent_id is null), although I'm only expecting one per forest.

Here is the view that uses a recursive CTE:

CREATE OR REPLACE VIEW tree_view_1 AS
    WITH RECURSIVE rec_sub_tree(forest_id, node_id, parent_id, depth, path, cycle) AS (
        SELECT td.forest_id, td.node_id, td.parent_id, 0, ARRAY[td.node_id], FALSE FROM tree_data_1 td
        UNION ALL
        SELECT td.forest_id, rec.node_id, td.parent_id, rec.depth+1, td.node_id || rec.path, td.node_id = ANY(rec.path)
            FROM tree_data_1 td, rec_sub_tree rec
            WHERE td.forest_id = rec.forest_id AND rec.parent_id = td.node_id AND NOT cycle
     )
     SELECT forest_id, node_id, parent_id, depth, path
         FROM rec_sub_tree;

This is a slightly modified version of the example in the documentation, to take into account the forest_id, and that returns rec.node_id in the recursive SELECT instead of what would be td.node_id.

The get the path from the root to a given node, this query can be used:

SELECT * FROM tree_view_1 WHERE forest_id='Forest A' AND node_id='...' AND parent_id IS NULL

The get a sub-tree, this query can be used:

SELECT * FROM tree_view_1 WHERE forest_id='Forest A' AND parent_id='...'

The get a full tree within a given forest:

SELECT * FROM tree_view_1 WHERE forest_id='Forest A' AND parent_id IS NULL

The last query uses the following query plan (viewable on explain.depesz.com):

 CTE Scan on rec_sub_tree  (cost=1465505.41..1472461.19 rows=8 width=132) (actual time=0.067..62480.876 rows=133495 loops=1)
   Filter: ((parent_id IS NULL) AND (forest_id = 'Forest A'::text))
   CTE rec_sub_tree
     ->  Recursive Union  (cost=0.00..1465505.41 rows=309146 width=150) (actual time=0.048..53736.585 rows=1645992 loops=1)
           ->  Seq Scan on tree_data_1 td  (cost=0.00..6006.16 rows=247316 width=82) (actual time=0.034..975.796 rows=247316 loops=1)
           ->  Hash Join  (cost=13097.90..145331.63 rows=6183 width=150) (actual time=2087.065..5842.870 rows=199811 loops=7)
                 Hash Cond: ((rec.forest_id = td.forest_id) AND (rec.parent_id = td.node_id))
                 ->  WorkTable Scan on rec_sub_tree rec  (cost=0.00..49463.20 rows=1236580 width=132) (actual time=0.017..915.814 rows=235142 loops=7)
                       Filter: (NOT cycle)
                 ->  Hash  (cost=6006.16..6006.16 rows=247316 width=82) (actual time=1871.964..1871.964 rows=247316 loops=7)
                       ->  Seq Scan on tree_data_1 td  (cost=0.00..6006.16 rows=247316 width=82) (actual time=0.017..872.725 rows=247316 loops=7)
 Total runtime: 62978.883 ms
(12 rows)

As expected, this isn't very efficient. I'm partly surprised it doesn't seem to make use of any index.

Considering that this data would be read often but rarely modified (perhaps a small modification every couple of weeks), what possible techniques are there to optimise such queries and/or data representation?

EDIT: I would also like to retrieve the tree in depth-first order. Using ORDER BY path also degrades substantially the speed of the query above.


Sample Python program to populate the table with test data (requires Psycopg2), probably a bit more than I expect to have in a more realistic situation:

from uuid import uuid4
import random
import psycopg2

random.seed(1234567890)
min_depth = 3
max_depth = 6
max_sub_width = 10
next_level_prob = 0.7

db_connection = psycopg2.connect(database='...')
cursor = db_connection.cursor()
query = "INSERT INTO tree_data_1(forest_id, node_id, parent_id) VALUES (%s, %s, %s)"

def generate_sub_tree(forest_id, parent_id=None, depth=0, node_ids=[]):
    if not node_ids:
        node_ids = [ str(uuid4()) for _ in range(random.randint(1, max_sub_width)) ]
    for node_id in node_ids:
        cursor.execute(query, [ forest_id, node_id, parent_id ])
        if depth < min_depth or (depth < max_depth and random.random() < next_level_prob):
            generate_sub_tree(forest_id, node_id, depth+1)

generate_sub_tree('Forest A', node_ids=['Node %d' % (i,) for i in range(10)])
generate_sub_tree('Forest B', node_ids=['Node %d' % (i,) for i in range(10)])

db_connection.commit()
db_connection.close()

Best Answer

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.