What you describe in your second paragraph sounds like the Entity Attribute Value pattern.
This is a known pattern for dealing with the type of data you are talking about. When you need to be able to be extensible, drive the types from the data and allow you to start tracking new types as you grow. It allows you stay flexible and agile and be a little more forgiving and generic in your database design and schema.
This pattern works, even in relational databases, but it does take more work to get good performance, and can cause you headaches down the line depending on how many rows, how many inserts you are doing and how frequently you query it. This SO Question talks about some of the pitfalls nicely in the firs answer listed with 20 votes. I have seen this model work well, especially in environments where the new types and entities really do come in and we want to let that be more application driven or user driven but we were fighting with performance more often in that model than a traditional relational database model but we needed the flexibility and we made it work. Definitely look at that SO question and do some more research on EAV and whatever DBMS you are using to find examples from others who have tried the same.
With the restrictions you have supplied and if you want to implement this is DBMS, I think you could use a supertype/subtype pattern for the 5 (or more) types of objects and only one additional table for the "object contains objects" list:
-- auxiliary table that has only 5 rows, one for each type
CREATE TABLE types
( level TINYINT NOT NULL PRIMARY KEY
, type_name VARCHAR(10) NOT NULL UNIQUE
) ;
INSERT INTO types (level, type_name)
VALUES (1,'A'), (2,'B'), (3,'C'), (4,'D'), (5,'E') ;
-- this is the supertype
CREATE TABLE objects
( object_id INT NOT NULL UNIQUE
, level TINYINT NOT NULL REFERENCES types (type_id)
, UNIQUE (level, object_id)
) ;
-- and the subtypes
CREATE TABLE a
( object_id INT NOT NULL PRIMARY KEY
, level TINYINT NOT NULL CHECK (level = 1)
--- various columns
, FOREIGN KEY (level, object_id)
REFERENCES objects (level, object_id)
) ;
--- and similarly for the other 4 tables
--- all referencing objects table
and finally the list table, which is simple many to many junction table with only an additional constraint to ensure that an object of higher level can contain only objects of lower levels:
CREATE TABLE contains
( container_level TINYINT NOT NULL
, container_id INT NOT NULL
, item_level TINYINT NOT NULL
, item_id INT NOT NULL
, PRIMARY KEY (container_id, item_id)
, FOREIGN KEY (container_level, container_id)
REFERENCES objects (level, object_id)
, FOREIGN KEY (item_level, item_id)
REFERENCES objects (level, object_id)
, CHECK (container_level > item_level)
) ;
SQLite has CHECK
constraints and has recently implemented recursive CTEs, if you want to get the results of an object and all the objects underneath it in one query.
From a fast search, H2 also supports recursive CTEs but Derby does not.
Best Answer
Normalization is a set of techniques to avoid certain problems viz data update anomolies. In a perfect world you would not store structured content but split it into its constituent parts across however many table are required and re-construct each complex object at runtime (1st normal form).
However .. if pre-parsing the object and storing the interesting / frequently used bits somewhere else resolves a performance issue then do this as long as you do it with your eyes open. If you, and everyone else who may code against the database, understands that extra work must be done in the application in order to achieve the necessary response times then that's a design choice and necessary in the real world.
Response to comment:
My reading of your question is that
content.text
contains structured text in which links are embedded. Your application shredscontent.text
from time to time to find and use these links. This is expensive and as a performance optimisation you would like to store the links separately incontent_link.link_id
.This is where the update anomalies can creep in. If
content
is updated so thatcontent.text
has a different set of links, butcontent_link
is not also updated then there is inconsistent data in the database. It works the other way, too - a row can be deleted fromcontent_link
withoutcontent.text
changing and, again, the data is inconsistent. This is why the application has to be diligently coded and tested.There is another rare-but-possible case to consider, too. Changing the data will require two statements -
update content
andupdate content_link
. RDBM systems use pre-emptive multitasking. This means that at any time the RDBMS may choose to halt one workload and run another instead. So it could, for example, halt the updating stream after the first update statement and before the second. If there is another workload which is trying to readcontent
orcontent_link
at that point it may run, and get inconsistent values, if the reading transaction's isolation level allows it to do so. As I said, rare but possible.If these problems are acceptably unlikely in your application and the implications of it happening are not severe enough to worry about then denormalisation can serve performance advantages. If the risk of data inconsistency is too great then you'll have to find other solutions.