How to best store a directory tree in a database

database-designtree

I want to represent my directory structure in some format (currently I'm just using JSON.)

This is how a sample JSON might look. For those curious it was generated using unix tree command: tree /path/to/folder -J --noreport -h.

{
    ...
    "type":"directory",
    "name":"dev",
    "size":4096,
    "contents":[
        {"type":"directory","name":"protocols","size":4096, "contents":[]},
        {"type":"file","name":"architecture.txt","size":4716},
        {"type":"file","name":"exceptions.py","size":31263},
        {"type":"file","name":"models.js","size":101882},
        {"type":"file","name":"proxy.cpp","size":29097},
        {"type":"file","name":"keylogfile.xyz","size":7889},
        {"type":"file","name":"Readme.txt","size":8857},
    ]
    ...
}

So this is just a representation of the entire folder structure of some path as JSON.

I can have many such separate JSON files, each representing a directory tree. There's no co-relation / links between these files.

On running the tree command on standard Windows "C:\" partition, I get a JSON file of ~30 MB in size. So I think we can assume that the max file size that a user would upload will be ~100 MB.

Once the files are stored, these are the operations I plan to make on a file:

Get the entire file.
Given a path, get it's immediate children (akin to doing ls on that path.)
Given a path, get the complete sub tree of the path.
Modify metadata of some item, say change its name or add a new note with it.

2 & 3 are the operations that I expect to happen the most.

Here are the ways of storing this data that I've come up with:

NO DB:
- Store the file as-is on disk (/home/forest/<uuid>.json)
- Operation no 1 becomes fast & easy – just send the entire file
- but the others can get slow because they all involve parsing the entire JSON first and then iterating on it.
NO SQL
- I've never used NO SQL databases before (only read a few posts about their use-cases etc.)
- I think op 1 (entire file read) would be fast
- but no idea if there will be any improvement on the other ops as compared to just using files.
RDBMS
- I've used relational DBs before but don't think my data has anything to do with tables
- I did google around though and found that postgres has an ltree type to store hierarchical data, but I'm not sure If that is what I need.
  - If it is, HOW will I get the data in?
Graph Databases?
- Again, no prior experience with these, just shooting in the dark
- At the end of the day, a directory is just a tree
- Instead of creating a vanilla JSON, maybe I could generate a format that one of the Graph DBs can read-in
- Once I have some graph DB object, maybe all the operations become fast enough.

My question is this: For my use case, what is the best way to store the data?

Replies to comments.

Why do you think you need a database for this?

Depending upon what you are trying to do you might not need to store the data in a db at all.

To be really honest, I don't know if I need databases, I know I want this data stored in a format that allows me to perform the above defined operations reasonably fast.

What are you trying to achieve by putting this into a db? Reporting? Analysis? Will it be used by an application?

I'm doing this for a web application. Once the data is stored in a way I'm satisfied with, I plan to create a Web API (probably JSON based) that performs the operations I've listed above. The data will be sent to the client where it will get displayed on the frontend in some way.

Do you want 1 row per file? What other meta data do you want to store? Size? date? File owner?

Yeah, I want the metadata that's usually associated with files.

A friend asked this, so I'm clearing it up here: I don't just have 1 JSON file (representing a tree). I could have n number of such trees (they are basically uploaded by the user and I expect them to be <100 MB in size.)

Best Answer

I'm a DB noob as well, I do recall postresql having json datatype for storing JSON structures. Maybe you can review the postgresql doc and decide if it works for you.

Related Solutions

How to store data with a query that’s approximated

You may want to try the retrieval (re trie val) tree, as known as the TRIE. Some refer to this as a Radix Tree.

The idea is to create tree node structures that contain branches for every character your data field can possibly contain.

Let's use a simple case, a numeric field. Obviously, the character range is 0-9. Each tree node would contain ten(10) branches. Let's take the worst case for a 4 byte unsigned integer, 2^32 - 1, which is 4294967295. What is its length? Just compute the length by take the integer of the log base 10 of 4294967295 and adding 1.

mysql> select floor(log10(power(2,32)) + 1);
+-------------------------------+
| floor(log10(power(2,32)) + 1) |
+-------------------------------+
|                            10 |
+-------------------------------+
1 row in set (0.00 sec)

So, you would have a TRIE with a maximum height of 10. Starting at the root of the TRIE, if you have the number 4294967295, you traverse branches 4,2,9,4,9,6,7,2,9, and 5. At each branch, you would perform an array-styled binary search.

If the branch located at that TRIE node is an exact match, you can assign a percentage to that level, and recursively walk down that branch to check for the next digit and return percentages from deeper TRIE nodes to add to the percentage you have at the currently searched TRIE node.

If the branch located at that TRIE node is NOT an exact match, you stop your recursive search there and return either 0 or some other percentage you may want to designate.

Given the sum of return values from all searched TRIE nodes, you may want to sum up the percentages and divide that answer by the length of the string. In other words,

Pct per Node = (1 / (Number of TRIE nodes that need to be searched)) or Zero(0).

Sum(Pct) = (Number of TRIE nodes exactly matched) / (Number of TRIE nodes that need to be searched [length of the string being searched]).

Given the length of the numberic field you store, you have O(log n) due to field length. For each TRIE node, you have O(log n) for searching for the proper branch. Overall, your search should have O(log (log n)) search time.

This performance stands out ever more if the field is alphanumeric. Assuming using only ASCII, each TRIE node would have 256 branches. The height of the TRIE would depend on the length of the character field. Representing this TRIE for variable-length strings would produce TRIE nodes that would be very sparse, but quickly searchable nonetheless.

Regardless what database you use, carefuly plan the data types you will be using to represent the TRIE node. You may also want to partition the table so that strings of length n terminate in partition n. Thus, you will have O(log n) search time at each partition.

http://en.wikipedia.org/wiki/Trie

http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node24.html

http://www.webreference.com/js/tips/000318.html

http://en.wikipedia.org/wiki/Radix_tree

PostgreSQL tree structure and recursive CTE optimization

If you really have to modify these data rarely, then you can simply store the result of the CTE in a table, and run queries against this table. You can define indexes based on your typical queries.
Then TRUNCATE and repopulate (and ANALYZE) as necessary.

On the other hand, if you can put the CTE in separate stored procedures rather than a view, you can easily put your conditions in the CTE part rather then the final SELECT (which is basically what you do querying against tree_view_1), so that much less rows will be involved in the recursion. From the query plan it looks like that PostgreSQL estimates row numbers based on some far-from-true assumptions, probably producing suboptimal plans - this effect can be reduced somewhat with the SP solution.

EDIT I may miss something, but just noticed that in the non-recursive term you don't filter the rows. Possibly you want to include only root nodes there (WHERE parent_id IS NULL) - I'd expect much less rows and recursions this way.

EDIT 2 AS it slowly became clear for me from the comments, I misthought the recursion in the original question going the other way. Here I mean starting from the root nodes and going deeper in the recursion.

Best Answer

Related Solutions

How to store data with a query that’s approximated

PostgreSQL tree structure and recursive CTE optimization

Related Question