Hierarchy Models – Adjacency List vs Nested Sets

firebirdhierarchytree

I've stumbled upon an article describing nested sets model for storing hierarchical data and was quite intrigued by the supposed performance boost.

after implementing this in our database, i've found that the adjecency list model performs far better. though the article I've read and a couple of SA posts (one, two) stated that any tree traversal is supposed to benefit, my tests demonstrate the opposite.

Am I missing something?

there are roughly 1.1 Million entries in the following table:

SQL> show table tax_nodes;
TAX_ID                          INTEGER Not Null
PARENT_TAX_ID                   INTEGER Nullable
RANK                            VARCHAR(50) Nullable
DIVISION_ID                     INTEGER Nullable
TREE_RIGHT                      INTEGER Nullable
TREE_LEFT                       INTEGER Nullable
CONSTRAINT INTEG_202:
  Primary key (TAX_ID)

Adjecency List Query to fetch parents

with recursive lineage as 
(select tax_id, parent_tax_id 
 from tax_nodes 
 where tax_id = 365612
 union ALL 
 select tax_id, parent_tax_id 
 from tax_nodes t 
 join lineage l on l.parent_tax_id = t.tax_id
) 
select l.tax_id
from lineage l

Nested Sets Query to fetch parents

select
  p2.tax_id from tax_nodes as p1,
  tax_nodes as p2
where
    p1.tree_left between p2.tree_left and p2.tree_right
    and p1.tax_id = 365612

Result

365612 336487 10260 10241 10240 35237 10239 1  //1 is the root

Why is the Adjecency List performing better though most resources suggest otherwise?

I'm running Firebird 2.5 superserver by the way.

UPDATE

On @MarkRotteveel's hint, I took a look at the execution plans and I suspect the NATURAL being the culprit here. Question now is, how do I get rid of it?

adjecency list execution plan

PLAN (LINEAGE TAX_NODES INDEX (RDB$PRIMARY108))
PLAN (LINEAGE T INDEX (RDB$PRIMARY108))

nested sets execution plan

PLAN JOIN (P1 INDEX (RDB$PRIMARY108), P2 NATURAL)

Best Answer

The execution plan of your adjacency list example shows that it is able to use the primary key index (RDB$PRIMARY108) for both parts of the UNION. This is very efficient, mainly because you start with a single value, and then recursively look up its descendents (which is probably just a small set of the entire set of records).

For the nested set execution plan it is not able to use an index, because it has to scan all records for the values of p2.tree_left and p2.tree_right. It can only use the primary key index for looking up p1.tax_id = 365612.

You might be able to improve execution if you add a descending index for tax_nodes.tree_left and an ascending index for tax_nodes.tree_right. You may need to swap ascending/descending because I usually do it wrong initially, even if I account for usually doing it wrong. However you may also need to change the query condition so the optimizer picks the indexes for tree_left and tree_right:

p2.tree_left <= p1.tree_left AND p2.tree_right >= p1.tree_left

Related Solutions

Mysql – Combining a tree structured Group model with its HABTM User model in one VIEW

This can't be done with a view.

Views in MySQL cannot reference variables. Try defining even a simple view that references a session variable, and there's no joy. Take a working query that references a variable and try creating a view using that query:

ERROR 1351 (HY000): View's SELECT contains a variable or parameter

Granted, there's a hack-around for this: you can call a stored function that you wrote, which returns the value of the variable you want to be accessible to the view. But, oops, we're already out of the scope of "view".

Additionally, a view, like a query, has no ability to iterate through rows and do something in "this" row because of what happened in "that" row.

Granted, again, there are hacks with variables in queries that can emulate some of that behavior but on the best days they can only look backwards and not forwards, and they don't always "see" the rows in the same sequence that the rows are returned.

I see three possible approaches to this, if SQL is where you really want to do it:

Option 1 involves a stored procedure and a temporary table. A stored procedure can iterate through rows with a cursor, and you'd need a temporary table since stored procedures don't have arrays or hashes. Iterate through your source data, either with a cursor or in a loop that starts by finding the top parent node and then each subsequent select based on what you've found so far... populating the temporary table as you go, taking precautions to avoid infinite loops caused by circular references in parent_id values (gotcha!), and then SELECT from your temporary table within the proc to return its contents as a result set to the client.

You might even end up with a second stored procedure that you call from the first one, which then recursively calls itself, to traverse your tree and build your nested set.

Option 2 involves insert, update, and delete triggers on the group and user tables, which would rebuild your MPTT structures every time any modification is done to group and user.

This would, in a sense, be the "most correct" way to do it, since your left and right ids would never be inconsistent with the underlying data... but as @dogmatic69 has pointed out, it's a very expensive operation... and not without its own snags, since you're limited to what you can to to table_x while you're inside a trigger on table_x.

On the other hand, with this option, when you needed to see the tree, the work to build it has already been done... so, much faster on SELECT.

Option 3 is to calculate the tree values with something other than SQL... so, really, I lied earlier, and there were only two SQL options I've come up with. Calculate it all in PHP, Perl, etc., and then populate the database with the values you calculated, which is how I generally do it. :) But I get away with it, because the parent/child relationships in my databases are not updated by any other process. If they were, I'd be staring down the barrel of Option 2.

Postgresql – Performing SELECT on EACH ROW in CTE or Nested QUERY

Let me know if this helps. This is using MSSQL (T-SQL) syntax so you might have to adjust it for postgres, but, using a temp table called #tree, containing the two columns id and parent and populated as your example is with

SELECT 1, 0
UNION 
SELECT 2, 1
UNION 
SELECT 3, 1
UNION
SELECT 4, 2
UNION 
SELECT 5, 2
UNION 
SELECT 6, 4
UNION
SELECT 7, 6
UNION 
SELECT 8, 6:

with cte as
(
    select id as currentnode, id as root
    from #tree
    union all
    select t.id as currentnode, cte.root 
    from #tree t join cte on t.parent = cte.currentnode
)
select *
from cte
order by root

The above recursive CTE will yield all nodes reachable starting from some root. You can do:

SELECT root, count(*)
from cte
group by root

The count(*) query should yield,

root   |  count
----------------
  1    |    8
  2    |    6
  3    |    1
  4    |    4
  5    |    1
  6    |    3
  7    |    1
  8    |    1

You could then subtract 1 from each of those to get the result you were hinting at. Is this what you were looking for? Even if not, I feel like a recursive CTE can probably help with what you need.

Best Answer

Related Solutions

Mysql – Combining a tree structured Group model with its HABTM User model in one VIEW

Postgresql – Performing SELECT on EACH ROW in CTE or Nested QUERY

Related Question