Sql-server – Implementing subtype of a subtype in type/subtype design pattern with mutually exclusive subclasses

database-designsql serversql-server-2012subtypes

Introduction

In order for this question to be useful for future readers I will use the generic data model to illustrate the problem I face.

Our data model consists of 3 entities, which shall be labeled as A, B and C. In order to keep things simple, all their attributes will be of int type.

Entity A has following attributes: D, E and X;

Entity B has following attributes: D, E and Y;

Entity C has following attributes: D and Z;

Since all entities share common attribute D, I have decided to apply type/subtype design.

Important: Entities are mutually exclusive! This means that entity is either A or B or C.

Problem:

Entities A and B have yet another common attribute E, but this attribute is not present in the entity C.

Question:

I would like to use the above described characteristic to further optimize my design, if possible.

To be honest, I have no idea how to do this, nor where to start trying, hence this post.

Best Answer

Insofar as this Question is a continuation of Is my implementation of type/subtype design pattern (for mutually exclusive subclasses) correct?, which is itself a continuation of Don't know how to transform variable entity into relational table, I would ask: what exactly are you trying to optimize? Storage? The object model? Query complexity? Query performance? There are trade-offs when optimizing one aspect vs another since you can't optimize all aspects at the same time.

I completely agree with Remus's points regarding:

  • There are pros and cons to each approach (i.e. the ever-present "it depends" factor), and
  • The first priority is the efficiency of the data model (an inefficient data model cannot be corrected for by clean and/or efficient app code)

That said, the choice you face is between the following, arranged in order of least normalization to most normalization:

  • promoting property E to the base-type Table
  • keeping it in multiple sub-type Tables
  • fully normalizing E out to a new, intermediary sub-class table on the same level as C, that A and B will directly be sub-classes of (@MDCCL's answer)

Let's look at each option:

Move property E to base-type Table

PROs

  • Reduced query complexity for queries that need E but not X, Y, or Z.
  • Potentially more efficient for queries that need E but not X, Y, or Z (especially aggregate queries) due to no JOIN.
  • Potential to create index on (D, E) (and if so, potentially a Filtered Index on (D, E) where EntityType <> C, if such a condition is allowed)

CONs

  • Cannot mark E as NOT NULL
  • Need extra CHECK CONSTRAINT on base-type table to ensure that E IS NULL when EntityType = C (though this is not a huge issue)
  • Need to educate users of the data model as to why E must be NULL, and should even be ignored entirely, when EntityType = C.
  • Slightly less efficient when E is a fixed-length type, and a large portion of the rows are for EntityType of C (i.e. not using E hence it is NULL), and not using either the SPARSE option on the column or Data Compression on the Clustered Index
  • Potentially less efficient for queries that do not need E since the presence of E in the base-type table will increase the size of each row which in turn decreases the number of rows that can fit on a data page. But this is highly dependent on the exact datatype of E, the FILLFACTOR, how many rows are in the base-type table, etc.

Keep property E in each sub-type Table

PROs

  • Cleaner data model (i.e. don't have to worry about educating others as to why column E in the base-type table shouldn't be used because "it really isn't there")
  • Probably more closely resembles the object-model
  • Can mark column as NOT NULL if this is a required property of the entity
  • No need for extra CHECK CONSTRAINT on base-type table to ensure that E IS NULL when EntityType = C (though this is not a huge gain)

CONs

  • Requires JOIN to sub-type Table(s) to get this property
  • Potentially slightly less efficient when needing E, due to the JOIN, depending on how many rows of A + B you have as opposed to how many rows of C there are.
  • Slightly more difficult / complex for operations that deal solely with entities A and B (and not C) as being the same "type". Of course, you could abstract this via a View that does a UNION ALL between a SELECT of the JOINed tables for A and another SELECT of the JOINed tables for B. That will reduce complexity of SELECT queries but not so helpful for INSERT and UPDATE queries.
  • Depending on the specific queries and how frequently they are executed, this could be a potential inefficiency in cases where having an index on (D, E) would really help one or more frequently used queries, since they cannot be indexed together.

Normalize E out to intermediary Table between base-class and A & B

(Please note that I do like @MDCCL's answer as a viable alternative, depending on circumstances. The following is not meant as a strict criticism of that approach, but as a means of adding some perspective -- mine, of course -- by evaluating it in the same context as the two options I had already proposed. This will make it easier to clarify what I see as the relative difference between full normalization and the current approach of partial normalization.)

PROs

  • data model is fully normalized (there can't be anything inherently wrong with this, given it is what RDBMS's are designed to do)
  • reduced query complexity for queries needing A and B, but not C (i.e. no need for two queries joined via UNION ALL)

CONs

  • slightly more space taken up (the Bar table duplicates the ID, and there is a new column, BarTypeCode) [negligible, but something to be aware of]
  • slight increase in query complexity as an additional JOIN is needed to get to either A or B
  • increased surface area for locking, mostly on INSERT (DELETE can be handled implicitly via marking Foreign Keys as ON CASCADE DELETE) since the Transaction will be held open slightly longer on the base-class table (i.e. Foo) [negligible, but something to be aware of]
  • no direct knowledge of actual type -- A or B -- within the base-class Table, Foo; it only knows of type Br which can be either A or B:

    Meaning, if you need to do queries over the general base info but need to either categorize by the entity type or filter out one or more entity types, then the base-class table doesn't have enough information, in which case you need to LEFT JOIN the Bar table. This will also reduce the effectiveness of indexing the FooTypeCode column.

  • no consistent approach to interacting with A & B vs C:

    Meaning, if each entity relates directly to the base-class table such that there is only ever that one JOIN to get the full entity, then everyone can more quickly and easily build up familiarity in terms of working with the data model. There will be a common approach to queries / Stored Procedures which makes them quicker to develop and less likely to have bugs. A consistent approach also makes it quicker and easier to add new sub-types in the future.

  • potentially less adaptable to business rules that change over time:

    Meaning, things always change, and it is fairly easy to move E up to the base-class Table if it becomes common to all sub-types. It is also easy enough to move a common property out to the sub-types if changes in the nature of the entities makes that a worth-while change. It is easy enough to either break a sub-type into two sub-types (just create another SubTypeID value) or to combine two or more sub-types into one. Conversely, what if E later on became a common property of all sub-types? Then the intermediary layer of the Bar table would be meaningless, and the added complexity would not be worth it. Of course, it is impossible to know if such a change would happen in 5 or even 10 years, so the Bar table is not necessarily, nor even highly likely to be, a bad idea (which is why I said "potentially less adaptable"). These are just points to consider; it's a gamble in either direction.

  • potentially inappropriate grouping:

    Meaning, just because the E property is shared between entity types A and B does not mean that A and B should be grouped together. Just because things "look" the same (i.e. same properties) does not mean that they are the same.

Summary

Just like deciding if/when to denormalize, how to best approach this particular situation depends on considering the following aspects of the usage of the data model and making sure that the benefits outweigh the costs:

  • how many rows you will have for each EntityType (look at least 5 years down the road, assuming above average growth)
  • how many GB will each of these tables (base-type and sub-types) be in 5 years?
  • what specific datatype is property E
  • is it only one property or are there a few, or even several, properties
  • what queries you will need that require E and how often they will be executed
  • what queries you will need that do not need E and how often they will be executed

I think I tend to default to keeping E in the separate sub-type tables because it is, at the very least, "cleaner". I would consider moving E to the base-type table IF: most of the rows were not for EntityType of C; and the number of rows was at least in the millions; and I more-often-than-not executed queries that needed E and/or the queries that would benefit from an index on (D, E) either execute very frequently and/or require enough system resources such that having the index reduces overall resource utilization, or at least prevents surges in resource consumption that go above acceptable levels or last long enough to cause excessive blocking and/or increases in deadlocks.


UPDATE

O.P. commented on this answer that:

My employers changed the business logic, removing E altogether!

This change is particularly important because it is exactly what I predicated might happen in the "CONs" subsection of the "Normalize E out to intermediary Table between base-class and A & B" section above (6th bullet point). The specific issue is how easy / difficult it is to refactor the data model when such changes happen (and they always do). Some will argue that any data model can be refactored / changed, so start with the ideal. But while it is true on a technical level that anything can be refactored, the reality of the situation is a matter of scale.

Resources are not infinite, not just CPU / Disk / RAM, but also development resources: time and money. Businesses are constantly setting priorities on projects because those resources are very limited. And quite often (at least in my experience), projects to gain efficiency (even both system performance as well as faster development / fewer bugs) are prioritized below projects that increase functionality. While it is frustrating for us technical folks because we understand what the long-term benefits of refactoring projects are, it is just the nature of business that the less-technical, business folks have an easier time seeing the direct relationship between new functionality and new revenue. What this boils down to is: "we will come back to fix that later" == "that problem will probably be there for the next 5 - 10 years because we will nearly always have more important things to work on (ironically, such as the support cases that keep coming up because we have not fixed it yet)".

With that in mind, if the size of the data is small enough such that changes can be made very query, and/or you have a maintenance window that is long enough to not only make the changes but to also roll-back if something goes wrong, then normalizing E out to an intermediary Table between the base-class table and the A & B sub-class tables could work (though that still leaves you with no direct knowledge of the specific type (A or B) in the base-class table). BUT, if you have hundreds of millions of rows in these tables, and an incredible amount of code referencing the tables (code that has to be tested when changes are made), then it usually pays to be more pragmatic than idealistic. And this is the environment that I had to deal with for years: 987 million rows & 615 GB in the base-class table, spread across 18 servers. And so much code hit these tables (base-class and sub-class tables) that there was a lot of resistance -- mostly from management but sometimes from the rest of the team -- to making any changes due to the amount of development and QA resources that would need to be allocated.

So, once again, the "best" approach can only be determined situation-by-situation: you need to know your system (i.e. how much data and how the tables and code all relate), how to accomplish the refactoring, and the people that you work with (your team and possibly management -- can you get their buy-in for such a project?). There are some changes that I had been mentioning and planning for 1 - 2 years, and took multiple sprints / releases to get maybe 85% of them implemented. But if you only have < 1 million rows and not a lot of code tied to these tables, then you are probably able to start out on the more ideal / "pure" side of things.

Just remember, whichever way you choose to go, pay attention to how that model works over the next 2 years at least (if possible). Pay attention to what worked and what caused pain, even if it seemed like the greatest idea at the time (which means you also need to allow yourself to accept screwing up -- we all do -- so that you can honestly evaluate pain-points). And pay attention to why certain decisions worked or didn't so that you can make decisions that are more likely to be "better" next time :-).