Avoid repeated traversing of hierarchical data when parsing a big tree frequently

database-designhierarchyoracleperformance

I have a hierarchical/tree data structure in Oracle/RDBMS where I need to traverse the tree frequently. The tree is huge and therefore parsing it every time will not be efficient. A better way could be to parse and store the data to give constant time turn out.

Consider the relationship below

Entity T1, T2, T3, T4

They have hierarchical structure i.e

T1 --one-to-many--> T2 --one-to-many--> T3 --one-to-many-->T4

All of them have many(optional)-to-many with XYZ.

I have to traverse/update this tree frequently where some nodes may even not have zero 'xyz'. This can be a traversal of 100 of thousands of nodes. How can I avoid such repeated expensive traversal? Will a materialized view help here?

enter image description here

We have to consider that T2, T3, T4 can be millions in number but the link T1<==>XYZ, T2<==>XYZ, T3<==>XYZ, T4<==>XYZ are Thousands in number.

Maybe the data structure below is one solution:

T1[id, name]
T2[id, name, T1.id]
T3[id, name, T2.id]
T4[id, name, T3.id]
XYZ[id, name]

LINK_T1_XYZ[T1.id, XYZ.id]
LINK_T1_T2_XYZ[T1.id, T2.id, XYZ.id] with a check T1 should parent of T2 
LINK_T1_T3_XYZ[T1.id, T3.id, XYZ.id]  "
LINK_T1_T4_XYZ[T1.id, T4.id, XYZ.id]  "

Consider the analogy of provider-receiver here now any of the entity in hierarchy[T1, T2, T3, T4] can provide any of it. Example:

t1_a1 --> xyz_a --> t2_a2
t2_a2 --> xyz_a --> t2_a3

A very simple design could be all permutation combination give a total of 4X4=16

Concept Link ==> Provider | Stuff | Receiver

T1_XYZ_T1_LINK [T1.id, XYZ.id, T1.id]
T1_XYZ_T2_LINK [T1.id, XYZ.id, T2.id]
T1_XYZ_T3_LINK [T1.id, XYZ.id, T3.id]
T1_XYZ_T4_LINK [T1.id, XYZ.id, T4.id]

T2_XYZ_T1_LINK [T2.id, XYZ.id, T1.id]
T2_XYZ_T2_LINK [T2.id, XYZ.id, T2.id]
T2_XYZ_T3_LINK [T2.id, XYZ.id, T3.id]
T2_XYZ_T4_LINK [T2.id, XYZ.id, T4.id]

T3_XYZ_T1_LINK [T3.id, XYZ.id, T1.id]
T3_XYZ_T2_LINK [T3.id, XYZ.id, T2.id]
T3_XYZ_T3_LINK [T3.id, XYZ.id, T3.id]
T3_XYZ_T4_LINK [T3.id, XYZ.id, T4.id]

T4_XYZ_T1_LINK [T4.id, XYZ.id, T1.id]
T4_XYZ_T2_LINK [T4.id, XYZ.id, T2.id]
T4_XYZ_T3_LINK [T4.id, XYZ.id, T3.id]
T4_XYZ_T4_LINK [T4.id, XYZ.id, T4.id]

That is so many join tables, there must be a better design to do this.

Business Context

Many of us want to know this but it is quite complex than it's abstract model, I would simplify and try to give an analogy.

Consider content distribution network like Netflix. Now Netflix (T1 entity e.g T1[N001, 'Netflix']) is the Organization with head office at New York City, USA. Under this, it has again regional offices (T2 entity, say east, west, north, south) and then under that sub-regional offices (T3 entity, state level, say New York, Virginia etc) and then branch offices – (T4 entity, say city level).

Consider the analogy at international level so that T2 can go up 100, 000.

XYZ – consider it content like TV series e.g: House of cards.

Every organization has contents which are either produced in-house or purchased from outside. Example: Netflix produces 'House of cards' so it can broadcast it in its own network and may sell it to some other network too.

Now the organization O1 can make a deal to sell it's stuff/show/movies (XYZ, say HOC) to another organization O2.

You may say a tuple T1|XYZ|T2 is enough to represent that i.e O1|HOC|O2 but we also need to store how we network/stream or distribute 'HOC' from O1 to O2 and there can be 16 ways (permutation of T1, T2, T3, T4) to do it,

Let's take an example:


o1
 |---r1.1
 |     |---c1.1.1
 |     |    |---h1.1.1.1
 |     |    |---h1.1.1.2
 |     |    |---h1.1.1.3
 |     |---c2
 |     |    |---h1.1.2.1
 |     |    |---h1.1.2.2
 |     |---C3
 |          |---h1.1.3.1
 |          |---h1.1.3.2
 |     
 |---r1.2
       |---C1.2.3
            |---h1.2.3.1
            |---h1.2.3.2

o2
 |---r2.1
 |     |---c2.1.1
 |     |    |---h2.1.1.1
 |     |    |---h2.1.1.2
 |     |    |---h2.1.1.3

 xyz = hoc 

Now some example ways in which o1 can provide hoc o2:

o1 --> hoc --> o2  
o1 --> hoc --> r2.1
o1 --> hoc --> c2.1.1.1
o1 --> hoc --> h2.1.1.1

r1.1 --> hoc --> o2  
r1.1 --> hoc --> r2.1
r1.1 --> hoc --> c2.1.1.1
r1.1 --> hoc --> h2.1.1.1

and so on.

So actually we need one structure to entertain two models:

  1. Organization model
  2. Distribution network

To be noted, Content is very limited here i.e XYZ may have records less than 50 mostly.

Best Answer

So initially it's

T2[
  id PK, 
  name, 
  T1_id NOT NULL, -- because it is hierarchy, T2 can not exist outside the parent (T1).
  FK1 = T1_id ref T1(id) 
]
...

and you also say

LINK_T1_T2_XYZ[T1.id, T2.id, XYZ.id] with a check T1 should parent of T2

You can migrate keys down the entities hierarchy

T2[
  id, 
  name, 
  T1_id NOT NULL,
  PK2 = (id, T1_id)
  FK2 = T1_id ref T1(id) 
]
T3[
  id, 
  name, 
  T1_id NOT NULL, 
  T2_id NOT NULL,
  PK3 = (id, T1_id, T2_id)
  FK3 = (T2_id, T1_id) ref T2(id, T1_id)      
]
...

The code for supporting T1, ..T4 entities and the entities which refer them will be a bit complicated but you'll get instant access to any of T1,..T4 from any referencing entity without trading off RI.