Here is what I think is intended in this model:
- Category is a kind of thing. In a data model it would be an entity type, in a database it would be a table.
- Attribute is a facet of a thing. In a data model it would be a predicate type, in a database it would be a column.
- Object is an instance of a thing. In a data model it would be an entity, in a database it would be a row (in a particular table).
- Value is an instance of a predicate or a column value in a database table.
- Domain Lookup is a mechanism that allows you to constrain attribute values. This implementation works by constraining the value to one in a set of legal values. In a database this could be likened to a check constraint or a lookup table. Obviously there are other types of constraints that could be imposed (but aren't in this metadata model) such as range constraints.
- RIConfig is a relational constraint between two kinds of things. This is a relationship type in a data model or a foreign key constraint in a database. At this level it is describing the relational rule, not the instances of relationships.
- RI is an instance of a relationship between to specific entities. In a physical database, this would be a record in a foreign key index (perhaps?)
You are quite right to observe that the modeling of relationships is flawed insofar as it doesn't account for the implementation of these relationships. RIConfig should join to Attribute not to Category - or rather, there should be an intersection table between RIConfig and Attribute to allow for compound primary/foreign key relationships.
As I noted above, domain constraints are also not handled with realistic flexibility.
Given the limitations of this model, it would not be adequate as a metadata repository for many real-world databases.
Normalization is a set of techniques to avoid certain problems viz data update anomolies. In a perfect world you would not store structured content but split it into its constituent parts across however many table are required and re-construct each complex object at runtime (1st normal form).
However .. if pre-parsing the object and storing the interesting / frequently used bits somewhere else resolves a performance issue then do this as long as you do it with your eyes open. If you, and everyone else who may code against the database, understands that extra work must be done in the application in order to achieve the necessary response times then that's a design choice and necessary in the real world.
Response to comment:
My reading of your question is that content.text
contains structured text in which links are embedded. Your application shreds content.text
from time to time to find and use these links. This is expensive and as a performance optimisation you would like to store the links separately in content_link.link_id
.
This is where the update anomalies can creep in. If content
is updated so that content.text
has a different set of links, but content_link
is not also updated then there is inconsistent data in the database. It works the other way, too - a row can be deleted from content_link
without content.text
changing and, again, the data is inconsistent. This is why the application has to be diligently coded and tested.
There is another rare-but-possible case to consider, too. Changing the data will require two statements - update content
and update content_link
. RDBM systems use pre-emptive multitasking. This means that at any time the RDBMS may choose to halt one workload and run another instead. So it could, for example, halt the updating stream after the first update statement and before the second. If there is another workload which is trying to read content
or content_link
at that point it may run, and get inconsistent values, if the reading transaction's isolation level allows it to do so. As I said, rare but possible.
If these problems are acceptably unlikely in your application and the implications of it happening are not severe enough to worry about then denormalisation can serve performance advantages. If the risk of data inconsistency is too great then you'll have to find other solutions.
Best Answer
As others have mentioned, this can be done rather easily via a SubClass / Inheritance model. I would advocate something closer to what Pieter Geerkens suggested than what VDohnal suggested as I prefer not having lots of NULL fields per row, depending on the "type" that the row represents (i.e. the [Value] field has to be NULL on an X row as X does not have that property, Y does but does not have the [attr] field so [attr] has to be NULL on rows representing Y objects). I just feel that is harder to maintain over time as someone needs to be VERY familiar with the app layer object model to know if the NULL field is a valid data point for that row or to be ignored for that row. Even if you have CHECK constraints to enforce NULL, that is yet something else to maintain as you add properties to objects over time and it is still not easily visible when writing queries against these tables. Please note that the issue, in the past, would have also been a needless waste of space related to fixed-length data types at the data layer, but that has been mostly alleviated by the SPARSE option which was introduced in SQL Server 2008 (well, assuming you are using SQL Server).
Notes:
Example SQL using Microsoft T-SQL syntax (but no features highly-specific to SQL Server):
Side Notes:
Just to be clear about differences between this suggestion and Pieter's (since they are very similar):
class A
havingList<Thing>
) is such that both of the other answers have misconstrued that relationship:VDohnal's has the relationship backwards such that a single Thing will have multiple A s.(this has since been corrected).On a practical level, I cannot see a case for doing the LEFT JOIN query at the bottom of Pieter's answer, but maybe on occasion there is. In such a case I have opted instead to query each type individually via INNER JOIN and then combining via UNION ALL (fields in one table that aren't in the other subclass table(s) need to be given literal values):
EDIT:
Here are two more notes that I thought of:
CREATE VIEW [X] AS SELECT * FROM [Thing] base INNER JOIN [ThingX] sub ON base.Thing_ID = sub.Thing_ID
and so on for the Y and Z objects.EDIT 2:
Renamed the [X], [Y], and [Z] tables to be [ThingX], [ThingY], and [ThingZ] respectively. This naming convention for the subclass tables is more representative of them being extensions and not complete objects / entities by themselves. This should also make the model (and working with the tables) conceptually clearer in that [X] (the View suggested directly above) = [Thing] + [ThingX], and so on with [Y] and [Z].
Also, the tables will be grouped together in SSMS, queries against
sys.tables
/sys.objects
/INFORMATION_SCHEMA.TABLES
and hopefully in your source control system where you have individual SQL scripts for each table :-).