Animals hierarchy use-case: multiple farms, multiple species

database-designhierarchy

Let's talk about an online service where goat/sheep/cattle/llama farmers would register and keep track of their animals.

One of the features of such a service is tracking parent/child relationship. However, it's not as trivial as adding a parent_id field. Here are the reasons why:

  1. There are many farmers, each with their own account; they're selling animals to each other. This means that not every parent can be entered into the system with all its properties. So there is no parent_id primary key to link to.

  2. There are two IDs, national and internal. None of them, one of them or both of them can be entered. There should be a priority in determining the animal's parent. Internal IDs have priorities. Then comes national IDs. This means that if both are specified, the parent should be determined by an internal ID.

  3. Not all animal cards are made public by their owners. This means that we cannot rely national IDs and look for parents through the whole database and fetch parents' data to display the tree.

  4. Furthermore, the National ID is a NOT a unique field among the database (but is unique for each user and species). This is made so to allow farmer1 enter national ID "ABC" even if farmer2 entered national ID "ABC" for another animal as well (e.g. by mistake).

As you can see, this all makes impossible to rely on a single parent_id column. That's where come various approaches to solve the issue, and I ask your advice.


Approach #1

Introduced the following columns:

  • mom_national_id (nullable)
  • mom_internal_id (nullable)
  • dad_national_id (nullable)
  • dad_internal_id (nullable)

Unique indexes (user_id, species_id, mom/dad_national_internal_id)

Approach #2

  • mom_id BIGINT UNSIGNED (nullable)
  • is_mom_id_national BOOL
  • dad_id BIGINT UNSIGNED (nullable)
  • is_dad_id_national BIGINT UNSIGNED (nullable)

So, here are my questions

Q1. Is either of the approaches above correct, or can you recommend a better one?

Q2. If both of them correct, which one would be better in terms of querying speeds for building parent/child trees? Does it matter at all if all is needed is just up to ~50 generations up or down?

Q3. Which indexes are needed for best performance?


Despite the solution should not be specific to database type, the service is being built with using PostgreSQL.

Best Answer

The best approach would be neither of the two options you've proposed. Instead you should create your own proprietary surrogate key using either a monotonically increasing integer or a GUID if that is how you like to do these things. This key column would be NOT NULL on your animal table (obviously).

You need to keep all of the external IDs that are provided to you (internal/national) and use these to establish the mother_id and father_id of each new animal that you add to your database according to the complex rules that you've described.

This approach allows your system to take advantage of declarative referential integrity and ANSI SQL joining while preserving the mixed bag of external IDs that are provided to you for reference/reporting etc.