Animals hierarchy use-case: multiple farms, multiple species

database-designhierarchy

Let's talk about an online service where goat/sheep/cattle/llama farmers would register and keep track of their animals.

One of the features of such a service is tracking parent/child relationship. However, it's not as trivial as adding a parent_id field. Here are the reasons why:

There are many farmers, each with their own account; they're selling animals to each other. This means that not every parent can be entered into the system with all its properties. So there is no parent_id primary key to link to.
There are two IDs, national and internal. None of them, one of them or both of them can be entered. There should be a priority in determining the animal's parent. Internal IDs have priorities. Then comes national IDs. This means that if both are specified, the parent should be determined by an internal ID.
Not all animal cards are made public by their owners. This means that we cannot rely national IDs and look for parents through the whole database and fetch parents' data to display the tree.
Furthermore, the National ID is a NOT a unique field among the database (but is unique for each user and species). This is made so to allow farmer1 enter national ID "ABC" even if farmer2 entered national ID "ABC" for another animal as well (e.g. by mistake).

As you can see, this all makes impossible to rely on a single parent_id column. That's where come various approaches to solve the issue, and I ask your advice.

Approach #1

Introduced the following columns:

mom_national_id (nullable)
mom_internal_id (nullable)
dad_national_id (nullable)
dad_internal_id (nullable)

Unique indexes (user_id, species_id, mom/dad_national_internal_id)

Approach #2

mom_id BIGINT UNSIGNED (nullable)
is_mom_id_national BOOL
dad_id BIGINT UNSIGNED (nullable)
is_dad_id_national BIGINT UNSIGNED (nullable)

So, here are my questions

Q1. Is either of the approaches above correct, or can you recommend a better one?

Q2. If both of them correct, which one would be better in terms of querying speeds for building parent/child trees? Does it matter at all if all is needed is just up to ~50 generations up or down?

Q3. Which indexes are needed for best performance?

Despite the solution should not be specific to database type, the service is being built with using PostgreSQL.

Best Answer

The best approach would be neither of the two options you've proposed. Instead you should create your own proprietary surrogate key using either a monotonically increasing integer or a GUID if that is how you like to do these things. This key column would be NOT NULL on your animal table (obviously).

You need to keep all of the external IDs that are provided to you (internal/national) and use these to establish the mother_id and father_id of each new animal that you add to your database according to the complex rules that you've described.

This approach allows your system to take advantage of declarative referential integrity and ANSI SQL joining while preserving the mixed bag of external IDs that are provided to you for reference/reporting etc.

Related Solutions

Hierarchy query with 3 tables

I've never tried anything like that, but what if you UNION your three tables into one, and then do a "one-table hierarchical query" on the result of that?

WTIH alltabs as
    (Select 'T1' as src, Table1.ID, NULL AS ID_table1, NULL as ID_table2, Table1.name, Table1.size
    from Table1
    union
    Select 'T2' as src, Table2.ID, Table2.ID_table1, NULL as ID_table2, Table2.name, Table2.size
    from Table2
    union
    Select 'T3' as src, Table3.ID, Table3.ID_table1, Table3.ID_table2, Table3.name, Table3.size
    from Table3)
Select *
from alltabs
/*you can have an optional STARTING WITH clause here, but it's not clear from your question what that would be*/
connect by (id = id_table1 and (src = 'T1' or src = 'T2'))
        or (id = id_table2 AND (src = 'T2' or src = 'T3'))

(I'm assuming that the relationship is Table1 PARENT OF Table2 PARENT OF Table3).

Sql-server – How to have a one-to-many relationship with a privileged child

Another way (without Nulls and without cycles in the FOREIGN KEY relationships) is to have a third table to store the "favourite children". In most DBMS, you'll need an additional UNIQUE constraint on TableB.

@Aaron was faster to identify that the naming convention above is rather cumbersome and can lead to errors. It's usually better (and will keep you sane) if you don't have Id columns all over your tables and if the columns (that are joined) have same names in the many tables that appear. So, here's a renaming:

Parent
    ParentID        INT NOT NULL PRIMARY KEY

Child
    ChildID         INT NOT NULL PRIMARY KEY
    ParentID        INT NOT NULL FOREIGN KEY REFERENCES Parent (ParentID)
    UNIQUE (ParentID, ChildID)

FavoriteChild
    ParentID        INT NOT NULL PRIMARY KEY
    ChildID         INT NOT NULL 
    FOREIGN KEY (ParentID, ChildID) 
        REFERENCES Child (ParentID, ChildID)

In SQL-Server (that you are using), you also have the option of the IsFavorite bit column you mention. The unique favourite child per parent can be accomplished via a filtered Unique Index:

Parent
    ParentID        INT NOT NULL PRIMARY KEY

Child
    ChildID         INT NOT NULL PRIMARY KEY
    ParentID        INT NOT NULL FOREIGN KEY REFERENCES Parent (ParentID)
    IsFavorite      BIT NOT NULL

CREATE UNIQUE INDEX is_FavoriteChild
  ON Child (ParentID)
  WHERE IsFavorite = 1 ;

And the main reason that your option 1 is not recommended, at least not in SQL-Server, is that the pattern of circular paths in the foreign key references has some problems.

Read a quite old article: SQL By Design: The Circular Reference

When inserting or deleting rows from the two table, you'll run into the "chicken-and-egg" problem. Which table should I insert first - without violating any constraint?

In order to solve that, you have to define at least one column nullable. (OK, technically you don't have to, you can have all columns as NOT NULL but only in DBMS, like Postgres and Oracle, that have implemented deferrable constraints. See @Erwin's answer in a similar question: Complex foreign key constraint in SQLAlchemy on how this can be done in Postgres). Still, this setup feels like skating on thin ice.

Check also an almost identical question at SO (but for MySQL) In SQL, is it OK for two tables to refer to each other? where my answer is pretty much the same. MySQL has no partial indexes though, so the only viable options are the nullable FK and the extra table solution.