Database Design – Sacrificing Normalization for Data Integrity

database-designdenormalizationforeign keynormalizationtable

I often come up against the following challenge, and end up de-normalizing my table design in order to enforce data integrity. I would be interested to hear if there is a different way of modelling the data that is both normalized and enforces integrity.

Here is a typical (simplified) example:

create table [ProductType]
(
    [ProductTypeId] INT identity(1,1) not null,
    [ProductTypeName] nvarchar(100) not null, 
    constraint [PK_ProductType] primary key ([ProductTypeId])
)

create table [Product]
(
    [ProductId] INT identity(1,1) not null,
    [ProductTypeId] int not null,
    [ProductName] nvarchar(100) not null,
    constraint [PK_Product] primary key ([ProductId]), 
    constraint [AK_Product] unique ([ProductId], [ProductTypeId]),
    constraint [FK_Product_ProductType] foreign key ([ProductTypeId]) references [ProductType]([ProductTypeId])
)

create table [ProductTypeProperty]
(
    [PropertyId] INT identity(1,1) not null,
    [ProductTypeId] int not null,
    [PropertyName] nvarchar(100) not null,
    constraint [PK_ProductTypeProperty] primary key ([PropertyId]), 
    constraint [AK_ProductTypeProperty] unique ([PropertyId], [ProductTypeId]),
    constraint [FK_ProductTypeProperty_ProductType] foreign key ([ProductTypeId]) references [ProductType]([ProductTypeId])
)

create table [ProductPropertyValue]
(
    [ProductId] INT not null,
    [PropertyId] INT not null,
    [ProductTypeId] int not null,
    [PropertyValue] nvarchar(100) not null,
    constraint [PK_ProductPropertyValue] primary key ([ProductId], [PropertyId]),
    constraint [FK_ProductPropertyValue_ProductTypeProperty] foreign key ([PropertyId], [ProductTypeId]) references [ProductTypeProperty]([PropertyId], [ProductTypeId]), 
    constraint [FK_ProductPropertyValue_Product] foreign key ([ProductId], [ProductTypeId]) references [Product]([ProductId], [ProductTypeId])
)

SET IDENTITY_INSERT [dbo].[ProductType] ON
INSERT [dbo].[ProductType] ([ProductTypeId], [ProductTypeName]) VALUES (1, N'Clothing')
INSERT [dbo].[ProductType] ([ProductTypeId], [ProductTypeName]) VALUES (2, N'Drink')
SET IDENTITY_INSERT [dbo].[ProductType] OFF

SET IDENTITY_INSERT [dbo].[ProductTypeProperty] ON
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (1, 1, N'Colour')
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (2, 1, N'Size')
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (3, 2, N'Volume')
SET IDENTITY_INSERT [dbo].[ProductTypeProperty] OFF

SET IDENTITY_INSERT [dbo].[Product] ON
INSERT [dbo].[Product] ([ProductId], [ProductTypeId], [ProductName]) VALUES (1, 1, N'T-shirt')
INSERT [dbo].[Product] ([ProductId], [ProductTypeId], [ProductName]) VALUES (2, 2, N'Milk')
SET IDENTITY_INSERT [dbo].[Product] OFF

INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 1, 1, N'Red')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 2, 1, N'XL')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (2, 3, 2, N'1 pint')

-- NOTE THAT THE FKS ON [PRODUCTPROPERTYVALUE] MEAN YOU CANNOT RUN EITHER OF THESE
-- WHICH TRY TO ASSIGN A PROPERTY TO A PRODUCT THAT DOESN'T BELONG TO ITS TYPE
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 3, 1, N'Red')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (2, 2, 1, N'XL')

Here's what I am trying to model:

  1. A product has a product type (e.g. "clothing", "drink" etc)
  2. A product type has a number of properties (e.g. "clothing" has "colour" and "size")
  3. A product has property values for any/all of the properties belonging to its type
  4. A product cannot have property values for properties not belonging to its type

Point #4 is the one that causes the challenge. In order to create foreign keys to achieve this ([FK_ProductProperty_ProductTypeProperty] and [FK_ProductPropertyValue_Product]) I have de-normalized and added "unnecessary" unique constraints:

  1. Added [ProductTypeId] to [ProductPropertyValue]
  2. Added a unique constraint on [PropertyId], [ProductTypeId] to [ProductPropertyValue]
  3. Added a unique constraint on [ProductId], [ProductTypeId] to [Product]

However it seems impossible to achieve all 4 of the above points without these … is it?

This challenge seems to come up when my design departs from a simple branching "snowflake" and there needs to be "loops" in the relationships:

P ---------> PT
^            ^
|            |
|            |
PPV ------> PTP

Note: this is a simple example with a 4 tables. With a real-world design with many levels of related tables, the issue becomes magnified (unique constraints on multiple columns, more than one "unnecessary" unique constraint per table etc).

Best Answer

I often come up against the following challenge, and end up de-normalizing my table design in order to enforce data integrity.

This sounds like a contradiction. It's normalizing that usually enforces integrity. And in my opinion, you have a misunderstanding. You did not de-normalize your design. It is normalized just fine.

I would be interested to hear if there is a different way of modeling the data that is both normalized and enforces integrity.

The specific issue - I call it diamond shape - appears quite often, although many do not realize or see it as a problem and do not enforce the constraint that you have enforced with your design.

And I don't think there is any other way, ignoring some details. A unique key of the "grandparent" table (PropertyType in this case), has to be part of unique constraints of the "parent" tables (Product, Property here) and to appear in the "child" table (ProductPropertyValue), exactly as you did.

Point #4 is the one that causes the challenge. In order to create foreign keys to achieve this ([FK_ProductProperty_ProductTypeProperty] and [FK_ProductPropertyValue_Product]) I have de-normalized and added "unnecessary" unique constraints:

I disagree about the "unnecessary". They are needed, first for the FOREIGN KEY constraints to be defined and work.

Your unique constraints are just fine and needed for the integrity reasons you have described:
to ensure that a ProductPropertyValue is related to a Product and a Property that both (Product and Property) belong to the same PropertyType.

The redundant in the design are the primary key constraints on the identity columns of Product and Property. They can be removed without any loss of integrity. I know it's more than common that an automatically incrementing (identity in SQL Server) column has also a UNIQUE or PRIMARY KEY constraint. But it doesn't really need it. It may be good for performance reasons to keep that constraint, though. You may also use the column as foreign key in some other table, where the there is no connection with property types (and thus the PropertyTypeId is unnecessary). But from a logical design and normalization point of view, there is no normalization issue. It's just an implementation, physical design detail.

I used the phrase "de-normalizing my table design" to mean "adding columns to tables that are already in other related tables" (i.e. the value for a given row can be derived from a related row). Isn't one aspect of normalisation the removal of redundant columns?

Yes, it is. But you have to start from entities, attributes and functional dependencies. When you add - before doing that - surrogate keys, you are tempted to use them everywhere as foreign key columns, considering that DBMS provides with the extra feature of Uniqueness. So, you may miss using some other unique column or combination. For example, (ProductTypeId, ProductName) might be able to identify products, too, so it could have a unique constraint, I don't know the details of your business rules. If it was, it could be used instead as a foreign key in the ProductPropertyValue. And similarly for Property.

Whether that would be more efficient is another matter, as long columns like nvarachar(100) are not the best fit for indexes. So, even if you had that unique constraint, you might end up using what you chose in the first place, for efficiency reasons. (Consider 4+4+4=12 vs 4+200+200=404 if we had the two nvarchars from Product and Property.) Choosing between a 12-bytes and a 404-bytes index is not hard choice. I'd think that all would go with the 12 size, even if the ProductPropertyValue is technically not in 3NF.

And I say "technically" and stress again that the only reason that ProductpropertyValue looks as if it is not in 3NF, is because the DBMS provided identity columns (ProductId and PropertyId for Product and Property) have the nice extra feature of Uniqueness (which actually isn't guaranteed in SQL Server. You could set IDENTITY_INSERT ON/OFF or some similar property in other DBMS and push values that are not unique in identity columns. Uniqueness is only guaranteed by the unique constraints you have in place.)

This challenge seems to come up when my design departs from a simple branching "snowflake" and there needs to be "loops" in the relationships:

P ---------> PT
^            ^
|            |
|            |
PPV ------> PTP

That's why I call it "diamond shape"!

And please note that there is no loop there!
- Just start anywhere you like the follow the arrows. Can you return to where you started?
- No. So there are no loops.

Note: this is a simple example with a 4 tables. With a real-world design with many levels of related tables, the issue becomes magnified (unique constraints on multiple columns, more than one "unnecessary" unique constraint per table etc).

Well, that is a problem. When the rules are complicated and involve many entities in multiple levels, the result is a complicated model, too, with multi-columns unique constraints and multi-column foreign keys. This may have performance implications as well as many DBMS do not like very much multi-column constraints (eg. the indexes are wider, the optimizers may not perform so well or not recognize at all multi-column foreign keys and miss query rewriting rules, etc.). And lets not talk about ORMs, where most of them expect a single column named id as the primary key in all tables ...

There is not much you can do about it. You either enforce the constraints with often complex model or remove some constraints from the design and simplify it.

You can also test several designs with various levels of missing rules, balancing performance, ease of use and development and enforcing constraints. But at least you'll know where exactly you simplified and which constraints are not enforced at the database level.


See a similar problem, discussed in another question: Best data modelling approach to handle redundant foreign keys in relational model, where @MDCCL has a great answer with many details and where he suggests the same solution as your design. The similarity is obvious just by looking at the graphs:

        PropertyType                     Survey
           /   \                          /   \
          /     \                        /     \
         /       \                      /       \          
    Product     Property       PersonSurvey   QuestionSurevy
         \       /                      \       /
          \     /                        \     /
           \   /                          \   /
    ProductPropertyValue                 Response

The details match as well, if we check the columns and the constraints of both problems. The difference is that MDCCL started from a logical design and you won't see any identity or other such implementation details that should be considered at the next phase, of physical design, after we have completed the logical design and normalization.