Sql-server – Implementing subtype of a subtype in type/subtype design pattern with mutually exclusive subclasses

database-designsql serversql-server-2012subtypes

Introduction

In order for this question to be useful for future readers I will use the generic data model to illustrate the problem I face.

Our data model consists of 3 entities, which shall be labeled as A, B and C. In order to keep things simple, all their attributes will be of int type.

Entity A has following attributes: D, E and X;

Entity B has following attributes: D, E and Y;

Entity C has following attributes: D and Z;

Since all entities share common attribute D, I have decided to apply type/subtype design.

Important: Entities are mutually exclusive! This means that entity is either A or B or C.

Problem:

Entities A and B have yet another common attribute E, but this attribute is not present in the entity C.

Question:

I would like to use the above described characteristic to further optimize my design, if possible.

To be honest, I have no idea how to do this, nor where to start trying, hence this post.

Best Answer

Insofar as this Question is a continuation of Is my implementation of type/subtype design pattern (for mutually exclusive subclasses) correct?, which is itself a continuation of Don't know how to transform variable entity into relational table, I would ask: what exactly are you trying to optimize? Storage? The object model? Query complexity? Query performance? There are trade-offs when optimizing one aspect vs another since you can't optimize all aspects at the same time.

I completely agree with Remus's points regarding:

There are pros and cons to each approach (i.e. the ever-present "it depends" factor), and
The first priority is the efficiency of the data model (an inefficient data model cannot be corrected for by clean and/or efficient app code)

That said, the choice you face is between the following, arranged in order of least normalization to most normalization:

promoting property E to the base-type Table
keeping it in multiple sub-type Tables
fully normalizing E out to a new, intermediary sub-class table on the same level as C, that A and B will directly be sub-classes of (@MDCCL's answer)

Let's look at each option:

Move property `E` to base-type Table

PROs

Reduced query complexity for queries that need E but not X, Y, or Z.
Potentially more efficient for queries that need E but not X, Y, or Z (especially aggregate queries) due to no JOIN.
Potential to create index on (D, E) (and if so, potentially a Filtered Index on (D, E) where EntityType <> C, if such a condition is allowed)

CONs

Cannot mark E as NOT NULL
Need extra CHECK CONSTRAINT on base-type table to ensure that E IS NULL when EntityType = C (though this is not a huge issue)
Need to educate users of the data model as to why E must be NULL, and should even be ignored entirely, when EntityType = C.
Slightly less efficient when E is a fixed-length type, and a large portion of the rows are for EntityType of C (i.e. not using E hence it is NULL), and not using either the SPARSE option on the column or Data Compression on the Clustered Index
Potentially less efficient for queries that do not need E since the presence of E in the base-type table will increase the size of each row which in turn decreases the number of rows that can fit on a data page. But this is highly dependent on the exact datatype of E, the FILLFACTOR, how many rows are in the base-type table, etc.

Keep property `E` in each sub-type Table

PROs

Cleaner data model (i.e. don't have to worry about educating others as to why column E in the base-type table shouldn't be used because "it really isn't there")
Probably more closely resembles the object-model
Can mark column as NOT NULL if this is a required property of the entity
No need for extra CHECK CONSTRAINT on base-type table to ensure that E IS NULL when EntityType = C (though this is not a huge gain)

CONs

Requires JOIN to sub-type Table(s) to get this property
Potentially slightly less efficient when needing E, due to the JOIN, depending on how many rows of A + B you have as opposed to how many rows of C there are.
Slightly more difficult / complex for operations that deal solely with entities A and B (and not C) as being the same "type". Of course, you could abstract this via a View that does a UNION ALL between a SELECT of the JOINed tables for A and another SELECT of the JOINed tables for B. That will reduce complexity of SELECT queries but not so helpful for INSERT and UPDATE queries.
Depending on the specific queries and how frequently they are executed, this could be a potential inefficiency in cases where having an index on (D, E) would really help one or more frequently used queries, since they cannot be indexed together.

Normalize `E` out to intermediary Table between base-class and `A` & `B`

_{(Please note that I do like @MDCCL's answer as a viable alternative, depending on circumstances. The following is not meant as a strict criticism of that approach, but as a means of adding some perspective -- mine, of course -- by evaluating it in the same context as the two options I had already proposed. This will make it easier to clarify what I see as the relative difference between full normalization and the current approach of partial normalization.)}

PROs

data model is fully normalized (there can't be anything inherently wrong with this, given it is what RDBMS's are designed to do)
reduced query complexity for queries needing A and B, but not C (i.e. no need for two queries joined via UNION ALL)

CONs

slightly more space taken up (the Bar table duplicates the ID, and there is a new column, BarTypeCode) [negligible, but something to be aware of]
slight increase in query complexity as an additional JOIN is needed to get to either A or B
increased surface area for locking, mostly on INSERT (DELETE can be handled implicitly via marking Foreign Keys as ON CASCADE DELETE) since the Transaction will be held open slightly longer on the base-class table (i.e. Foo) [negligible, but something to be aware of]
no direct knowledge of actual type -- A or B -- within the base-class Table, Foo; it only knows of type Br which can be either A or B:

Meaning, if you need to do queries over the general base info but need to either categorize by the entity type or filter out one or more entity types, then the base-class table doesn't have enough information, in which case you need to LEFT JOIN the Bar table. This will also reduce the effectiveness of indexing the FooTypeCode column.
no consistent approach to interacting with A & B vs C:

Meaning, if each entity relates directly to the base-class table such that there is only ever that one JOIN to get the full entity, then everyone can more quickly and easily build up familiarity in terms of working with the data model. There will be a common approach to queries / Stored Procedures which makes them quicker to develop and less likely to have bugs. A consistent approach also makes it quicker and easier to add new sub-types in the future.
potentially less adaptable to business rules that change over time:

Meaning, things always change, and it is fairly easy to move E up to the base-class Table if it becomes common to all sub-types. It is also easy enough to move a common property out to the sub-types if changes in the nature of the entities makes that a worth-while change. It is easy enough to either break a sub-type into two sub-types (just create another SubTypeID value) or to combine two or more sub-types into one. Conversely, what if E later on became a common property of all sub-types? Then the intermediary layer of the Bar table would be meaningless, and the added complexity would not be worth it. Of course, it is impossible to know if such a change would happen in 5 or even 10 years, so the Bar table is not necessarily, nor even highly likely to be, a bad idea (which is why I said "potentially less adaptable"). These are just points to consider; it's a gamble in either direction.
potentially inappropriate grouping:

Meaning, just because the E property is shared between entity types A and B does not mean that A and B should be grouped together. Just because things "look" the same (i.e. same properties) does not mean that they are the same.

Summary

Just like deciding if/when to denormalize, how to best approach this particular situation depends on considering the following aspects of the usage of the data model and making sure that the benefits outweigh the costs:

how many rows you will have for each EntityType (look at least 5 years down the road, assuming above average growth)
how many GB will each of these tables (base-type and sub-types) be in 5 years?
what specific datatype is property E
is it only one property or are there a few, or even several, properties
what queries you will need that require E and how often they will be executed
what queries you will need that do not need E and how often they will be executed

I think I tend to default to keeping E in the separate sub-type tables because it is, at the very least, "cleaner". I would consider moving E to the base-type table IF: most of the rows were not for EntityType of C; and the number of rows was at least in the millions; and I more-often-than-not executed queries that needed E and/or the queries that would benefit from an index on (D, E) either execute very frequently and/or require enough system resources such that having the index reduces overall resource utilization, or at least prevents surges in resource consumption that go above acceptable levels or last long enough to cause excessive blocking and/or increases in deadlocks.

UPDATE

O.P. commented on this answer that:

My employers changed the business logic, removing E altogether!

This change is particularly important because it is exactly what I predicated might happen in the "CONs" subsection of the "Normalize E out to intermediary Table between base-class and A & B" section above (6th bullet point). The specific issue is how easy / difficult it is to refactor the data model when such changes happen (and they always do). Some will argue that any data model can be refactored / changed, so start with the ideal. But while it is true on a technical level that anything can be refactored, the reality of the situation is a matter of scale.

Resources are not infinite, not just CPU / Disk / RAM, but also development resources: time and money. Businesses are constantly setting priorities on projects because those resources are very limited. And quite often (at least in my experience), projects to gain efficiency (even both system performance as well as faster development / fewer bugs) are prioritized below projects that increase functionality. While it is frustrating for us technical folks because we understand what the long-term benefits of refactoring projects are, it is just the nature of business that the less-technical, business folks have an easier time seeing the direct relationship between new functionality and new revenue. What this boils down to is: "we will come back to fix that later" == "that problem will probably be there for the next 5 - 10 years because we will nearly always have more important things to work on (ironically, such as the support cases that keep coming up because we have not fixed it yet)".

With that in mind, if the size of the data is small enough such that changes can be made very query, and/or you have a maintenance window that is long enough to not only make the changes but to also roll-back if something goes wrong, then normalizing E out to an intermediary Table between the base-class table and the A & B sub-class tables could work (though that still leaves you with no direct knowledge of the specific type (A or B) in the base-class table). BUT, if you have hundreds of millions of rows in these tables, and an incredible amount of code referencing the tables (code that has to be tested when changes are made), then it usually pays to be more pragmatic than idealistic. And this is the environment that I had to deal with for years: 987 million rows & 615 GB in the base-class table, spread across 18 servers. And so much code hit these tables (base-class and sub-class tables) that there was a lot of resistance -- mostly from management but sometimes from the rest of the team -- to making any changes due to the amount of development and QA resources that would need to be allocated.

So, once again, the "best" approach can only be determined situation-by-situation: you need to know your system (i.e. how much data and how the tables and code all relate), how to accomplish the refactoring, and the people that you work with (your team and possibly management -- can you get their buy-in for such a project?). There are some changes that I had been mentioning and planning for 1 - 2 years, and took multiple sprints / releases to get maybe 85% of them implemented. But if you only have < 1 million rows and not a lot of code tied to these tables, then you are probably able to start out on the more ideal / "pure" side of things.

Just remember, whichever way you choose to go, pay attention to how that model works over the next 2 years at least (if possible). Pay attention to what worked and what caused pain, even if it seemed like the greatest idea at the time (which means you also need to allow yourself to accept screwing up -- we all do -- so that you can honestly evaluate pain-points). And pay attention to why certain decisions worked or didn't so that you can make decisions that are more likely to be "better" next time :-).

Related Solutions

Mysql – Product Attribute List Design Pattern

I personally would use a model similar to the following:

The product table would be pretty basic, your main product details:

create table product
(
  part_number int, (PK)
  name varchar(10),
  price int
);
insert into product values
(1, 'product1', 50),
(2, 'product2', 95.99);

Second the attribute table to store the each of the different attributes.

create table attribute
(
  attributeid int, (PK)
  attribute_name varchar(10),
  attribute_value varchar(50)
);
insert into attribute values
(1, 'color', 'red'),
(2, 'color', 'blue'),
(3, 'material', 'chrome'),
(4, 'material', 'plastic'),
(5, 'color', 'yellow'),
(6, 'size', 'x-large');

Finally create the product_attribute table as the JOIN table between each product and its attributes associated with it.

create table product_attribute
(
  part_number int, (FK)
  attributeid int  (FK) 
);
insert into product_attribute values
(1,  1),
(1,  3),
(2,  6),
(2,  2),
(2,  6);

Depending on how you want to use the data you are looking at two joins:

select *
from product p
left join product_attribute t
  on p.part_number = t.part_number
left join attribute a
  on t.attributeid = a.attributeid;

See SQL Fiddle with Demo. This returns data in the format:

PART_NUMBER | NAME       | PRICE | ATTRIBUTEID | ATTRIBUTE_NAME | ATTRIBUTE_VALUE
___________________________________________________________________________
1           | product1   | 50    | 1           | color          | red
1           | product1   | 50    | 3           | material       | chrome
2           | product2   | 96    | 6           | size           | x-large
2           | product2   | 96    | 2           | color          | blue
2           | product2   | 96    | 6           | size           | x-large

But if you want to return the data in a PIVOT format where you have one row with all of the attributes as columns, you can use CASE statements with an aggregate:

SELECT p.part_number,
  p.name,
  p.price,
  MAX(IF(a.ATTRIBUTE_NAME = 'color', a.ATTRIBUTE_VALUE, null)) as color,
  MAX(IF(a.ATTRIBUTE_NAME = 'material', a.ATTRIBUTE_VALUE, null)) as material,
  MAX(IF(a.ATTRIBUTE_NAME = 'size', a.ATTRIBUTE_VALUE, null)) as size
from product p
left join product_attribute t
  on p.part_number = t.part_number
left join attribute a
  on t.attributeid = a.attributeid
group by p.part_number, p.name, p.price;

See SQL Fiddle with Demo. Data is returned in the format:

PART_NUMBER | NAME       | PRICE | COLOR | MATERIAL | SIZE
_________________________________________________________________
1           | product1   | 50    | red   | chrome   | null
2           | product2   | 96    | blue  | null     | x-large

As you case see the data might be in a better format for you, but if you have an unknown number of attributes, it will easily become untenable due to hard-coding attribute names, so in MySQL you can use prepared statements to create dynamic pivots. Your code would be as follows (See SQL Fiddle With Demo):

SET @sql = NULL;
SELECT
  GROUP_CONCAT(DISTINCT
    CONCAT(
      'MAX(IF(a.attribute_name = ''',
      attribute_name,
      ''', a.attribute_value, NULL)) AS ',
      attribute_name
    )
  ) INTO @sql
FROM attribute;

SET @sql = CONCAT('SELECT p.part_number
                    , p.name
                    , ', @sql, ' 
                   from product p
                   left join product_attribute t
                     on p.part_number = t.part_number
                   left join attribute a
                     on t.attributeid = a.attributeid
                   GROUP BY p.part_number
                    , p.name');

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

This generates the same result as the second version with no need to hard-code anything. While there are many ways to model this I think this database design is the most flexible.

One to at most one relation or multiple null columns

There is nothing wrong with nullable columns, even a lot of them, if that is what your data domain calls for.

On the other hand, the fact that you say these columns can be grouped logically implies to me that something else might be going on.

If they can be grouped logically because different sets of columns apply to different sets of rows, then you might have an entity-subtype situation.

Conversely, if columns can be grouped because they apply at different times, then you may have a normalization issue. For example, if your columns are something like "January Sales", "February Sales", etc. these should be rows in a child table.

While there is nothing innately wrong with nullable columns, neither is there anything wrong with joining. It's what RDBMS does for a living.

UPDATE:

Given additional information about the logical groups of columns:

There are two kinds of sub-typing that can be represented in a database using 1:1 relationships. If the logical groups were mutually exclusive, then the parent entity could have what is known as a partitioning attribute that tells you which one of the subtypes is applicable. However, without a partitioning attribute, it is possible to have zero, one or even multiple subtypes being applicable at the same time.

The same fundamental question applies then to what you do with this situation.

A good way to resolve it would be to look at the logical groups of columns. Are the columns in logical group A the same as in logical group B - or are the totally different? If they are different they might be best modeled in the single table with nullable fields. If they are the same, then this might be a clue that they should be multiple child rows instead.

Another clue to look at is whether it makes sense that a logical group of columns could take on a life of its own and start attracting relationships from other tables. If logical group B might sometime soon find itself with multiple child records from another table, then it might be a sign that it makes sense to promote that group to its own subtype style table.

One last thing to consider is physical implementation. If a logical subgroup is very sparsely populated, you might be able to make a case for segregating these columns into another table to optimize physical storage. This step shouldn't be done proactively. This kind of optimization should be done when performance testing proves it is necessary.

If none of these things are true, then you are probably best off leaving the nullable columns in the original table.

Introduction

Problem:

Question:

Best Answer

Move property E to base-type Table

Keep property E in each sub-type Table

Normalize E out to intermediary Table between base-class and A & B

Summary

Related Solutions

Mysql – Product Attribute List Design Pattern

One to at most one relation or multiple null columns

Related Question

Move property `E` to base-type Table

Keep property `E` in each sub-type Table

Normalize `E` out to intermediary Table between base-class and `A` & `B`