Database Design – Sacrificing Normalization for Data Integrity

database-designdenormalizationforeign keynormalizationtable

I often come up against the following challenge, and end up de-normalizing my table design in order to enforce data integrity. I would be interested to hear if there is a different way of modelling the data that is both normalized and enforces integrity.

Here is a typical (simplified) example:

create table [ProductType]
(
    [ProductTypeId] INT identity(1,1) not null,
    [ProductTypeName] nvarchar(100) not null, 
    constraint [PK_ProductType] primary key ([ProductTypeId])
)

create table [Product]
(
    [ProductId] INT identity(1,1) not null,
    [ProductTypeId] int not null,
    [ProductName] nvarchar(100) not null,
    constraint [PK_Product] primary key ([ProductId]), 
    constraint [AK_Product] unique ([ProductId], [ProductTypeId]),
    constraint [FK_Product_ProductType] foreign key ([ProductTypeId]) references [ProductType]([ProductTypeId])
)

create table [ProductTypeProperty]
(
    [PropertyId] INT identity(1,1) not null,
    [ProductTypeId] int not null,
    [PropertyName] nvarchar(100) not null,
    constraint [PK_ProductTypeProperty] primary key ([PropertyId]), 
    constraint [AK_ProductTypeProperty] unique ([PropertyId], [ProductTypeId]),
    constraint [FK_ProductTypeProperty_ProductType] foreign key ([ProductTypeId]) references [ProductType]([ProductTypeId])
)

create table [ProductPropertyValue]
(
    [ProductId] INT not null,
    [PropertyId] INT not null,
    [ProductTypeId] int not null,
    [PropertyValue] nvarchar(100) not null,
    constraint [PK_ProductPropertyValue] primary key ([ProductId], [PropertyId]),
    constraint [FK_ProductPropertyValue_ProductTypeProperty] foreign key ([PropertyId], [ProductTypeId]) references [ProductTypeProperty]([PropertyId], [ProductTypeId]), 
    constraint [FK_ProductPropertyValue_Product] foreign key ([ProductId], [ProductTypeId]) references [Product]([ProductId], [ProductTypeId])
)

SET IDENTITY_INSERT [dbo].[ProductType] ON
INSERT [dbo].[ProductType] ([ProductTypeId], [ProductTypeName]) VALUES (1, N'Clothing')
INSERT [dbo].[ProductType] ([ProductTypeId], [ProductTypeName]) VALUES (2, N'Drink')
SET IDENTITY_INSERT [dbo].[ProductType] OFF

SET IDENTITY_INSERT [dbo].[ProductTypeProperty] ON
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (1, 1, N'Colour')
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (2, 1, N'Size')
INSERT [dbo].[ProductTypeProperty] ([PropertyId], [ProductTypeId], [PropertyName]) VALUES (3, 2, N'Volume')
SET IDENTITY_INSERT [dbo].[ProductTypeProperty] OFF

SET IDENTITY_INSERT [dbo].[Product] ON
INSERT [dbo].[Product] ([ProductId], [ProductTypeId], [ProductName]) VALUES (1, 1, N'T-shirt')
INSERT [dbo].[Product] ([ProductId], [ProductTypeId], [ProductName]) VALUES (2, 2, N'Milk')
SET IDENTITY_INSERT [dbo].[Product] OFF

INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 1, 1, N'Red')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 2, 1, N'XL')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (2, 3, 2, N'1 pint')

-- NOTE THAT THE FKS ON [PRODUCTPROPERTYVALUE] MEAN YOU CANNOT RUN EITHER OF THESE
-- WHICH TRY TO ASSIGN A PROPERTY TO A PRODUCT THAT DOESN'T BELONG TO ITS TYPE
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (1, 3, 1, N'Red')
INSERT [dbo].[ProductPropertyValue] ([ProductId], [PropertyId], [ProductTypeId], [PropertyValue]) VALUES (2, 2, 1, N'XL')

Here's what I am trying to model:

A product has a product type (e.g. "clothing", "drink" etc)
A product type has a number of properties (e.g. "clothing" has "colour" and "size")
A product has property values for any/all of the properties belonging to its type
A product cannot have property values for properties not belonging to its type

Point #4 is the one that causes the challenge. In order to create foreign keys to achieve this ([FK_ProductProperty_ProductTypeProperty] and [FK_ProductPropertyValue_Product]) I have de-normalized and added "unnecessary" unique constraints:

Added [ProductTypeId] to [ProductPropertyValue]
Added a unique constraint on [PropertyId], [ProductTypeId] to [ProductPropertyValue]
Added a unique constraint on [ProductId], [ProductTypeId] to [Product]

However it seems impossible to achieve all 4 of the above points without these … is it?

This challenge seems to come up when my design departs from a simple branching "snowflake" and there needs to be "loops" in the relationships:

P ---------> PT
^            ^
|            |
|            |
PPV ------> PTP

Note: this is a simple example with a 4 tables. With a real-world design with many levels of related tables, the issue becomes magnified (unique constraints on multiple columns, more than one "unnecessary" unique constraint per table etc).

Best Answer

I often come up against the following challenge, and end up de-normalizing my table design in order to enforce data integrity.

This sounds like a contradiction. It's normalizing that usually enforces integrity. And in my opinion, you have a misunderstanding. You did not de-normalize your design. It is normalized just fine.

I would be interested to hear if there is a different way of modeling the data that is both normalized and enforces integrity.

The specific issue - I call it diamond shape - appears quite often, although many do not realize or see it as a problem and do not enforce the constraint that you have enforced with your design.

And I don't think there is any other way, ignoring some details. A unique key of the "grandparent" table (PropertyType in this case), has to be part of unique constraints of the "parent" tables (Product, Property here) and to appear in the "child" table (ProductPropertyValue), exactly as you did.

Point #4 is the one that causes the challenge. In order to create foreign keys to achieve this ([FK_ProductProperty_ProductTypeProperty] and [FK_ProductPropertyValue_Product]) I have de-normalized and added "unnecessary" unique constraints:

I disagree about the "unnecessary". They are needed, first for the FOREIGN KEY constraints to be defined and work.

Your unique constraints are just fine and needed for the integrity reasons you have described:
to ensure that a ProductPropertyValue is related to a Product and a Property that both (Product and Property) belong to the same PropertyType.

The redundant in the design are the primary key constraints on the identity columns of Product and Property. They can be removed without any loss of integrity. I know it's more than common that an automatically incrementing (identity in SQL Server) column has also a UNIQUE or PRIMARY KEY constraint. But it doesn't really need it. It may be good for performance reasons to keep that constraint, though. You may also use the column as foreign key in some other table, where the there is no connection with property types (and thus the PropertyTypeId is unnecessary). But from a logical design and normalization point of view, there is no normalization issue. It's just an implementation, physical design detail.

I used the phrase "de-normalizing my table design" to mean "adding columns to tables that are already in other related tables" (i.e. the value for a given row can be derived from a related row). Isn't one aspect of normalisation the removal of redundant columns?

Yes, it is. But you have to start from entities, attributes and functional dependencies. When you add - before doing that - surrogate keys, you are tempted to use them everywhere as foreign key columns, considering that DBMS provides with the extra feature of Uniqueness. So, you may miss using some other unique column or combination. For example, (ProductTypeId, ProductName) might be able to identify products, too, so it could have a unique constraint, I don't know the details of your business rules. If it was, it could be used instead as a foreign key in the ProductPropertyValue. And similarly for Property.

Whether that would be more efficient is another matter, as long columns like nvarachar(100) are not the best fit for indexes. So, even if you had that unique constraint, you might end up using what you chose in the first place, for efficiency reasons. (Consider 4+4+4=12 vs 4+200+200=404 if we had the two nvarchars from Product and Property.) Choosing between a 12-bytes and a 404-bytes index is not hard choice. I'd think that all would go with the 12 size, even if the ProductPropertyValue is technically not in 3NF.

And I say "technically" and stress again that the only reason that ProductpropertyValue looks as if it is not in 3NF, is because the DBMS provided identity columns (ProductId and PropertyId for Product and Property) have the nice extra feature of Uniqueness (which actually isn't guaranteed in SQL Server. You could set IDENTITY_INSERT ON/OFF or some similar property in other DBMS and push values that are not unique in identity columns. Uniqueness is only guaranteed by the unique constraints you have in place.)

This challenge seems to come up when my design departs from a simple branching "snowflake" and there needs to be "loops" in the relationships:

P ---------> PT
^            ^
|            |
|            |
PPV ------> PTP

That's why I call it "diamond shape"!

And please note that there is no loop there!
- Just start anywhere you like the follow the arrows. Can you return to where you started?
- No. So there are no loops.

Note: this is a simple example with a 4 tables. With a real-world design with many levels of related tables, the issue becomes magnified (unique constraints on multiple columns, more than one "unnecessary" unique constraint per table etc).

Well, that is a problem. When the rules are complicated and involve many entities in multiple levels, the result is a complicated model, too, with multi-columns unique constraints and multi-column foreign keys. This may have performance implications as well as many DBMS do not like very much multi-column constraints (eg. the indexes are wider, the optimizers may not perform so well or not recognize at all multi-column foreign keys and miss query rewriting rules, etc.). And lets not talk about ORMs, where most of them expect a single column named id as the primary key in all tables ...

There is not much you can do about it. You either enforce the constraints with often complex model or remove some constraints from the design and simplify it.

You can also test several designs with various levels of missing rules, balancing performance, ease of use and development and enforcing constraints. But at least you'll know where exactly you simplified and which constraints are not enforced at the database level.

See a similar problem, discussed in another question: Best data modelling approach to handle redundant foreign keys in relational model, where @MDCCL has a great answer with many details and where he suggests the same solution as your design. The similarity is obvious just by looking at the graphs:

        PropertyType                     Survey
           /   \                          /   \
          /     \                        /     \
         /       \                      /       \          
    Product     Property       PersonSurvey   QuestionSurevy
         \       /                      \       /
          \     /                        \     /
           \   /                          \   /
    ProductPropertyValue                 Response

The details match as well, if we check the columns and the constraints of both problems. The difference is that MDCCL started from a logical design and you won't see any identity or other such implementation details that should be considered at the next phase, of physical design, after we have completed the logical design and normalization.

Disclaimer

This is experimental and only tested rudimentarily. Proceed at your own risk. I would not use it myself and just drop / recreate constraints with standard DDL commands. If you break entries in the catalog tables you could easily mess up your database.

For all I know, there are only two differences between a PRIMARY KEY and a UNIQUE constraint in the catalog tables (the index itself is identical):

pg_index.indisprimary:
For PRIMARY KEY constraint ... TRUE
For UNIQUE constraint ... FALSE

pg_constraint.contype:
PRIMARY KEY constraint ... 'p'
UNIQUE constraint ... 'u'

You could convert constraint and index in place, from PRIMARY KEY constraint to UNIQUE constraint, my_idx being the (optionally schema-qualified) index name:

UPDATE pg_index SET indisprimary = FALSE WHERE indexrelid = 'my_idx'::regclass
UPDATE pg_constraint SET contype = 'u' WHERE conindid = 'my_idx'::regclass;

Or upgrade from UNIQUE to PRIMARY KEY:

UPDATE pg_index SET indisprimary = TRUE WHERE indexrelid = 'my_idx'::regclass;
UPDATE pg_constraint SET contype = 'p' WHERE conindid = 'my_idx'::regclass;

PostgreSQL – insert/update violates foreign key constraints

There are a few problems with your tables. I'll try to address the foreign keys first, since you question asked about them :)

But before that, we should realize that the two sets of tables (the first three you created and the second set, which you created after dropping the first set) are the same. Of course, the definition of Table3 in your second attempt has syntax and logical errors, but the basic idea is:

CREATE TABLE table3 (   
  "ID" bigint NOT NULL DEFAULT '0',   
  "DataID" bigint DEFAULT NULL,   
  "Address" numeric(20) DEFAULT NULL,   
  "Data" bigint DEFAULT NULL,
   PRIMARY KEY ("ID"),   
   FOREIGN KEY ("DataID") REFERENCES Table1("DataID") on delete cascade on update cascade,   
   FOREIGN KEY ("Address") REFERENCES Table2("Address") on delete cascade on update cascade
);

This definition tell PostgreSQL roughly the following: "Create a table with four columns, one will be the primary key (PK), the others can be NULL. If a new row is inserted, check DataID and Address: if they contain a non-NULL value (say 27856), then check Table1 for DataID˙and Table2 for Address. If there is no such value in those tables, then return an error." This last point which you've seen first:

ERROR: insert or update on table "Table3" violates foreign key constraint 
    "Table3_DataID_fkey" DETAIL: Key (DataID)=(27856) is not present in table "Table1".

So simple: if there is no row in Table1 where DataID = 27856, then you can't insert that row into Table3.

If you need that row, you should first insert a row into Table1 with DataID = 27856, and only then try to insert into Table3. If this seems to you not what you want, please describe in a few sentences what you want to achieve, and we can help with a good design.

And now about the other problems.

You define your PKs as

CREATE all_your_tables (
    first_column NOT NULL DEFAULT '0',   
    [...]
    PRIMARY KEY ("ID"),

A primary key means that all the items in it are different from each other, that is, the values are UNIQUE. If you give a static DEFAULT (like '0') to a UNIQUE column, you will experience bad surprises all the time. This is what you got in your third error message.

Furthermore, '0' means a text string, but not a number (bigint or numeric in your case). Use simply 0 instead (or don't use it at all, as I written above).

And a last point (I may be wrong here): in Table2, your Address field is set to numeric(20). At the same time, it is the PK of the table. The column name and the data type suggests that this address can change in the future. If this is true, than it is a very bad choice for a PK. Think about the following scenario: you have an address '1234567890454', which has a child in Table3 like

ID        DataID           Address             Data
123       3216547          1234567890454       654897564134569

Now that address happens to change to something other. How do you make your child row in Table3 follow its parent to the new address? (There are solutions for this, but can cause much confusion.) If this is your case, add an ID column to your table, which will not contain any information from the real world, it will simply serve as an identification value (that is, ID) for an address.

Best Answer

Related Solutions

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

PostgreSQL – insert/update violates foreign key constraints

Related Question