How to Store Trees on Disk

database-designphysical-designstoragetreexml

I am wondering how to store trees on a physical disk, for perhaps a tree-oriented database like an XML database.

Wikipedia states about XML databases:

…custom optimized data structures are used for storage and querying. This usually increases performance both in terms of read-only queries and updates. XML nodes and documents are the fundamental unit of (logical) storage, just as a relational database has fields and rows.

But doesn't explain what the actual data structure is for storing the tree on disk. Wondering if one could explain how that works. It seems you could store each "document" as its own file such as this as a file:

<a>
  <b>
    <c>
      <d>Hello</d>
    </c>
    <c>
      <x>World</x>
    </c>
    <e>
      <f ref="b:0/c:0/d"></f>
    </e>
    <g>
      <f ref="b:0/c:0/x"></f>
    </g>
  </b>
  <b>
    <c>
      <d>Hello2</d>
    </c>
    <c>
      <x>World2</x>
    </c>
    <e>
      <f ref="b:1/c:0/d"></f>
    </e>
    <g>
      <f ref="b:1/c:0/x"></f>
    </g>
  </b>
</a>

And then you would have a query "give me all b/c/x where b/c/d matches Hello", or just "give me all nodes below c". The ref="b:0/c:0/d" is a pointer to a specific node, something like that.

Basically I'm wondering, in order to accomplish these sorts of things, the data structure(s) to use to store XML or any tree structure on disk.

Maybe instead of storing the whole XML document in a file, you end up storing something like this:

a/b/c/d Hello
a/b/c/d World
a/b/e/f ref=b:0/c:0/d
a/b/g/f ref=b:0/c:1/d
...

That's pretty much what I'm wondering, how the data would look in a file on disk at a physical level. Not really sure what to search for. XML database data structures and such doesn't return much.

Best Answer

Every XML database will have its own solutions to these problems. You could do worse that explore the source code of an open-source XML database.

The main aspects of any solution will be (a) the ability to navigate from a node to its children or parent by following some kind of pointer structure rather than by parsing raw XML and searching sequentially, (b) the availability of indexes that enable you to find nodes quickly given properties of the node, e.g. indexing elements by name.