MySQL – Parent/Child Relationship Based on Field Name Implication

MySQLrelational-theory

I am working with the harmonized system and try to catalog this in a MySQL database. This works for example like this:

01  LIVE ANIMALS
0101  horses, asses, mules and hinnies, live
0102  bovine animals, live
0103  swine, live
010310  Purebred breeding animals
010391  Weighing less than 50 (11023 lb) each
010392  Weighing 50 (11023 lb) or more each
0104  sheep and goats, live
0105  chickens, ducks, geese, turkeys, and guineas, live
0106  animals, live, nesoi Nesoi - not elsewhere specified of indicated.

As you can see, there is a clear parent-child relationship. This relationship can always be traced back by the numbering that is used. From this thread I learned that I best use a text data field for the codes (01, 0101, etc). My question is: should I somewhere log the parent-child relationship, or would this not be necessary? After all, I can do a:

SELECT * FROM accounts WHERE code LIKE '0103%'

And get all the subaccounts just fine. And I feel that a parent-child table could make things more complicated. But I'd like to hear what other people feel make sense.

Best Answer

What you describe is a variant of "Materialized Path", see for example https://communities.bmc.com/docs/DOC-9902 for an example. I say variant because you have no separator token, but are instead using a fixed number of positions for each different level.

In Materialized Path you usually don't have an explicit parent/child relationship represented, instead it is encoded into the "path" string. To determine sub-categories, use like as in your example. For ancestor-categories you need some procedural part that splits the path-string. Direct parent is the first ancestor. Direct children can be located by using a combination of like and length() function. Root and leaf predicates are easy to define as well.

As a starting point, why would you like to add a parent-child relationship (if it is not for the operations described above)? If the answer is referential integrity it will be a bit backward since you already have the domain at hand (your existing table). Adding a separate relation (table) to represent the parent/child relationship would be something like:

create table ... 
( category varchar(...) not null
,     constraint ... 
          foreign key ... references ...
, parent_category varchar(...) not null 
,     constraint ... 
          foreign key ... references ...
)

You could have a trigger populate this table after insert of a new "category"
But it does not really protect you from anything, it would act more like a report of your encoded parent child relation.

The other option is to add a nullable attribute to your existing table, your "top-categories" would have null as a parent, and the others would have substr(code, 1, length(code)-2) as a parent. This would be some kind of mix between "adjacency list" and "materialized path". You would still have to protect against anomalies like ('01020304', NULL) and ('010203','0202'). It is possible to add triggers that validates this.

If you want to keep your current table as is it is probably easiest to add validation code, either in triggers or in your application that makes sure that there exists a "super-category".

If you are willing to split the code in parts you could have a look at "Nested Sets". It is described in the link I gave above, or as an additional transitive closure relation, see for example http://karwin.blogspot.se/2009/04/sql-antipatterns-strike-back-slides.html. A slight variation of the same method is described at: http://dustbite.se/tree/ . Both "Nested Set" and "Transitive Closure" will add a significant overhead to what is probably the most common operation that you do

Personally I would stick with what you have and add validation triggers that checks that the codes are consistent during modifications of the "tree"

As Oracle XE is mentioned as an alternative DBMS in @VĂ©race post, I just point out that other commersial vendors is offering free versions as well. See Microsoft SQL Server 2014 Express (http://msdn.microsoft.com/en-us/evalcenter/dn434042.aspx) and DB2 Express-C ( http://www-01.ibm.com/software/data/db2/express-c/download.html). DB2 does not have a space restriction like Oracle, I'm not familiar with the restrictions of SQL-server.