I suspect that most if not all of your PropertyString
s are only valid for one specific PropertyType
and not all of them. In that case you would still have a transitive dependency between PropertyString
and PropertyType
which can be further normalized.
To resolve this, make both PropertyTypeId
and PropertyId
the primary keys of table PropertyString
. (This would mean PropertyId=1 PropertyTypeId=1
is a different value than PropertyId=1 PropertyTypeId=2
).
But the question is if a further normalization is not just possible but actually beneficial. Denormalizing your database schema as jrhutch suggests might be the better direction to take. But that depends on your exact requirements. If all your PropertyString
s are unique (or at least more often unique than duplicates), that additional table to store them would just be unnecessary complication without adding any real value.
Insofar as this Question is a continuation of Is my implementation of type/subtype design pattern (for mutually exclusive subclasses) correct?, which is itself a continuation of Don't know how to transform variable entity into relational table, I would ask: what exactly are you trying to optimize? Storage? The object model? Query complexity? Query performance? There are trade-offs when optimizing one aspect vs another since you can't optimize all aspects at the same time.
I completely agree with Remus's points regarding:
- There are pros and cons to each approach (i.e. the ever-present "it depends" factor), and
- The first priority is the efficiency of the data model (an inefficient data model cannot be corrected for by clean and/or efficient app code)
That said, the choice you face is between the following, arranged in order of least normalization to most normalization:
- promoting property
E
to the base-type Table
- keeping it in multiple sub-type Tables
- fully normalizing
E
out to a new, intermediary sub-class table on the same level as C
, that A
and B
will directly be sub-classes of (@MDCCL's answer)
Let's look at each option:
Move property E
to base-type Table
PROs
- Reduced query complexity for queries that need
E
but not X
, Y
, or Z
.
- Potentially more efficient for queries that need
E
but not X
, Y
, or Z
(especially aggregate queries) due to no JOIN.
- Potential to create index on
(D, E)
(and if so, potentially a Filtered Index on (D, E)
where EntityType <> C
, if such a condition is allowed)
CONs
- Cannot mark
E
as NOT NULL
- Need extra
CHECK CONSTRAINT
on base-type table to ensure that E IS NULL
when EntityType = C
(though this is not a huge issue)
- Need to educate users of the data model as to why
E
must be NULL
, and should even be ignored entirely, when EntityType = C
.
- Slightly less efficient when
E
is a fixed-length type, and a large portion of the rows are for EntityType of C
(i.e. not using E
hence it is NULL
), and not using either the SPARSE
option on the column or Data Compression on the Clustered Index
- Potentially less efficient for queries that do not need
E
since the presence of E
in the base-type table will increase the size of each row which in turn decreases the number of rows that can fit on a data page. But this is highly dependent on the exact datatype of E
, the FILLFACTOR, how many rows are in the base-type table, etc.
Keep property E
in each sub-type Table
PROs
- Cleaner data model (i.e. don't have to worry about educating others as to why column
E
in the base-type table shouldn't be used because "it really isn't there")
- Probably more closely resembles the object-model
- Can mark column as
NOT NULL
if this is a required property of the entity
- No need for extra
CHECK CONSTRAINT
on base-type table to ensure that E IS NULL
when EntityType = C
(though this is not a huge gain)
CONs
- Requires JOIN to sub-type Table(s) to get this property
- Potentially slightly less efficient when needing
E
, due to the JOIN, depending on how many rows of A
+ B
you have as opposed to how many rows of C
there are.
- Slightly more difficult / complex for operations that deal solely with entities
A
and B
(and not C
) as being the same "type". Of course, you could abstract this via a View that does a UNION ALL
between a SELECT
of the JOINed tables for A
and another SELECT
of the JOINed tables for B
. That will reduce complexity of SELECT queries but not so helpful for INSERT
and UPDATE
queries.
- Depending on the specific queries and how frequently they are executed, this could be a potential inefficiency in cases where having an index on
(D, E)
would really help one or more frequently used queries, since they cannot be indexed together.
Normalize E
out to intermediary Table between base-class and A
& B
(Please note that I do like @MDCCL's answer as a viable alternative, depending on circumstances. The following is not meant as a strict criticism of that approach, but as a means of adding some perspective -- mine, of course -- by evaluating it in the same context as the two options I had already proposed. This will make it easier to clarify what I see as the relative difference between full normalization and the current approach of partial normalization.)
PROs
- data model is fully normalized (there can't be anything inherently wrong with this, given it is what RDBMS's are designed to do)
- reduced query complexity for queries needing
A
and B
, but not C
(i.e. no need for two queries joined via UNION ALL
)
CONs
- slightly more space taken up (the
Bar
table duplicates the ID, and there is a new column, BarTypeCode
) [negligible, but something to be aware of]
- slight increase in query complexity as an additional
JOIN
is needed to get to either A
or B
- increased surface area for locking, mostly on
INSERT
(DELETE
can be handled implicitly via marking Foreign Keys as ON CASCADE DELETE
) since the Transaction will be held open slightly longer on the base-class table (i.e. Foo
) [negligible, but something to be aware of]
no direct knowledge of actual type -- A
or B
-- within the base-class Table, Foo
; it only knows of type Br
which can be either A
or B
:
Meaning, if you need to do queries over the general base info but need to either categorize by the entity type or filter out one or more entity types, then the base-class table doesn't have enough information, in which case you need to LEFT JOIN
the Bar
table. This will also reduce the effectiveness of indexing the FooTypeCode
column.
no consistent approach to interacting with A
& B
vs C
:
Meaning, if each entity relates directly to the base-class table such that there is only ever that one JOIN to get the full entity, then everyone can more quickly and easily build up familiarity in terms of working with the data model. There will be a common approach to queries / Stored Procedures which makes them quicker to develop and less likely to have bugs. A consistent approach also makes it quicker and easier to add new sub-types in the future.
potentially less adaptable to business rules that change over time:
Meaning, things always change, and it is fairly easy to move E
up to the base-class Table if it becomes common to all sub-types. It is also easy enough to move a common property out to the sub-types if changes in the nature of the entities makes that a worth-while change. It is easy enough to either break a sub-type into two sub-types (just create another SubTypeID
value) or to combine two or more sub-types into one. Conversely, what if E
later on became a common property of all sub-types? Then the intermediary layer of the Bar
table would be meaningless, and the added complexity would not be worth it. Of course, it is impossible to know if such a change would happen in 5 or even 10 years, so the Bar
table is not necessarily, nor even highly likely to be, a bad idea (which is why I said "potentially less adaptable"). These are just points to consider; it's a gamble in either direction.
potentially inappropriate grouping:
Meaning, just because the E
property is shared between entity types A
and B
does not mean that A
and B
should be grouped together. Just because things "look" the same (i.e. same properties) does not mean that they are the same.
Summary
Just like deciding if/when to denormalize, how to best approach this particular situation depends on considering the following aspects of the usage of the data model and making sure that the benefits outweigh the costs:
- how many rows you will have for each EntityType (look at least 5 years down the road, assuming above average growth)
- how many GB will each of these tables (base-type and sub-types) be in 5 years?
- what specific datatype is property
E
- is it only one property or are there a few, or even several, properties
- what queries you will need that require
E
and how often they will be executed
- what queries you will need that do not need
E
and how often they will be executed
I think I tend to default to keeping E
in the separate sub-type tables because it is, at the very least, "cleaner". I would consider moving E
to the base-type table IF: most of the rows were not for EntityType of C
; and the number of rows was at least in the millions; and I more-often-than-not executed queries that needed E
and/or the queries that would benefit from an index on (D, E)
either execute very frequently and/or require enough system resources such that having the index reduces overall resource utilization, or at least prevents surges in resource consumption that go above acceptable levels or last long enough to cause excessive blocking and/or increases in deadlocks.
UPDATE
O.P. commented on this answer that:
My employers changed the business logic, removing E altogether!
This change is particularly important because it is exactly what I predicated might happen in the "CONs" subsection of the "Normalize E
out to intermediary Table between base-class and A
& B
" section above (6th bullet point). The specific issue is how easy / difficult it is to refactor the data model when such changes happen (and they always do). Some will argue that any data model can be refactored / changed, so start with the ideal. But while it is true on a technical level that anything can be refactored, the reality of the situation is a matter of scale.
Resources are not infinite, not just CPU / Disk / RAM, but also development resources: time and money. Businesses are constantly setting priorities on projects because those resources are very limited. And quite often (at least in my experience), projects to gain efficiency (even both system performance as well as faster development / fewer bugs) are prioritized below projects that increase functionality. While it is frustrating for us technical folks because we understand what the long-term benefits of refactoring projects are, it is just the nature of business that the less-technical, business folks have an easier time seeing the direct relationship between new functionality and new revenue. What this boils down to is: "we will come back to fix that later" == "that problem will probably be there for the next 5 - 10 years because we will nearly always have more important things to work on (ironically, such as the support cases that keep coming up because we have not fixed it yet)".
With that in mind, if the size of the data is small enough such that changes can be made very query, and/or you have a maintenance window that is long enough to not only make the changes but to also roll-back if something goes wrong, then normalizing E
out to an intermediary Table between the base-class table and the A
& B
sub-class tables could work (though that still leaves you with no direct knowledge of the specific type (A
or B
) in the base-class table). BUT, if you have hundreds of millions of rows in these tables, and an incredible amount of code referencing the tables (code that has to be tested when changes are made), then it usually pays to be more pragmatic than idealistic. And this is the environment that I had to deal with for years: 987 million rows & 615 GB in the base-class table, spread across 18 servers. And so much code hit these tables (base-class and sub-class tables) that there was a lot of resistance -- mostly from management but sometimes from the rest of the team -- to making any changes due to the amount of development and QA resources that would need to be allocated.
So, once again, the "best" approach can only be determined situation-by-situation: you need to know your system (i.e. how much data and how the tables and code all relate), how to accomplish the refactoring, and the people that you work with (your team and possibly management -- can you get their buy-in for such a project?). There are some changes that I had been mentioning and planning for 1 - 2 years, and took multiple sprints / releases to get maybe 85% of them implemented. But if you only have < 1 million rows and not a lot of code tied to these tables, then you are probably able to start out on the more ideal / "pure" side of things.
Just remember, whichever way you choose to go, pay attention to how that model works over the next 2 years at least (if possible). Pay attention to what worked and what caused pain, even if it seemed like the greatest idea at the time (which means you also need to allow yourself to accept screwing up -- we all do -- so that you can honestly evaluate pain-points). And pay attention to why certain decisions worked or didn't so that you can make decisions that are more likely to be "better" next time :-).
Best Answer
If I'm understanding your description,
PersonID_ID
is a duplicate ofCitizen_IdentityID
orForeigner_WorkPermitID
(whichever is not NULL). If this is true, I'd suggest something similar to whatDCook
has suggested:Citizen_IdentityID
andForeigner_WorkPermitID
(eliminate duplicate data)Person_Type
(char(1) not NULL) column; current values could be 'C' or 'F' for Citizen/Foreigner, respectivelyDepending on where your data is coming from, this design would allow for the inclusion of future person types (eg, in the US you have citizens, visa holders, green card holders; in some countries there are clear differences in citizen vs resident vs visitor vs non-visitor visa - not to mention different types of visas). Obviously the need for additional person types will depend on what you're trying to model.
Looking at this a bit more ...
Can an individual's type/status change over time (eg, foreigner becomes citizen; citizen relinquishes citizenship and becomes a foreigner)?
The reason I ask is that you'd obviously need to change the value of the
Person_Type
andPersonID_ID
columns ... probably not a big deal ... but based on your model it appears you'll be usingPersonID_ID
as part of the PK in at least one other table, so that table's PK would also need to be updated ... while certainly doable, this could quickly get messy depending on the number of tables and records that would need their PK modified, plus the hassles of working around RI constraints. The mess gets a little nastier if you also need to maintain some sort of history/audit-record for a person (eg, having to store/map the changing of PKs ... yuck!).So now I'm wondering if
PersonID_ID
should be a separate, truly unique value, independent of thePerson_Type
? and if so, then you'd probably need to retain theCitizen_IdentityID
/Foreigner_WorkPermitID
value, but perhaps give it a more generic name.So, that would change my suggestion to something like:
Person_Type
(char(1) not NULL) column; current values could be 'C' or 'F' for Citizen/Foreigner, respectivelyPersonID_ID
to a be it's own unique value (could be done through something like an identity column or a unique-PK generator); this value would never change, even ifPerson_Type
were to change over timeCitizen_IdentityID
/Foreigner_WorkPermitID
columns with a single, generic ID that serves as a FK to theCitizen
/Foreigner
tables, as appropriate); perhaps something likePerson_TypeID
(int not NULL)?OK, one more concern/question ...
Can a person ever be associated with more than 1
Person_Type
(C/F)?For example, if a foreigner goes through the naturalization process to become a citizen, how would this be represented in your model?
Would the person now have both a
Foreigner_WorkPermitID
and aCitizen_IdentityID
, thus requiring the storage of both values in thePerson
table?Or would you update the
Person
table to show a change from 'Foreigner' to 'Citizen' (ie, changePerson_Type
andPerson_TypeID
)?