Why is data integrity not possible when using denormalization

database-designdenormalization

This tutorial says the following:

For proponents of denormalization, the thinking is as follows:
normalization creates more tables as you proceed towards higher normal
forms, but more tables mean there are more joins to be made when data
is retrieved, which in turn slows down your queries. For that reason,
to improve the performance of certain queries, you can override the
advantages to data integrity
and return the data structure to a
lower normal form.

Why is data integrity not possible when using denormalization? I mean wouldn't the DBMS make sure that all redundant copies of some piece of data gets updated automatically when that piece of data is updated?

In Oracle, for example, denormalization is implemented using materialized views, and when the original tables are modified, then the data in the materialized view will be modified automatically (you can choose that the data not be modified automatically also). And even if you don't use materialized views (if your DBMS doesn't support it for example) and implemented the denormalization manually, you can still write triggers yourself to keep the data consistent (so using denormalization does not necessarily mean that data integrity is not possible).

Best Answer

Data integrity is possible in both normalized and denormalized databases. What is different in the two cases is the number of different ways an application program can write bad data to the database.

Let's take a simple case of bad data. In the database, it says that employee with id 123 has last name "Brown". But in the real world, employee 123 has last name "Browne". This is bad data no matter how you look at it. But let's go a little deeper.

In one database, a normalized one, the last name is stored in precisely one place, in the Employees table. It's either right or it's wrong. And, if it's wrong, it's up to someone who knows the right answer to fix the data in the database. This is part of data management, although not a very exciting part to those of us who work in automation.

In another database, a denormalized one, the last name of employee 123 is stored in two tables, the Employees table and the Contacts table. The contacts table is about phone numbers and addresses. In the normalized database, there is no contacts table. It is a view instead. Now, there are more possibilities. The last name could be right in both places, or it could be wrong in both places. But there is another failure mode. It could be right in one place, but wrong in the other place. Now the database is not only wrong, but self contradictory. This is simply not possible in the normalized database, because the last name is stored only once.

This is an example of what is called harmful redundancy. Second Normal Form through Fifth Normal Form are about eliminating cases of harmful redundancy, by imposing rules that cause one fact to be stored in one place. This doesn't guarantee data integrity, but at least it obviates data inconsistency. You still have to manage data to get it all right, but now the DBMS is helping you manage data correctly.

Performance is a whole other story. It may be the case that one of the databases outperforms the other for certain kinds of traffic. It's even possible that one of the databases is more performant for updates but the other is more performant for queries. But great performance and wrong answers may be catastrophic in some cases, and acceptable in other cases.

In the example I've given, the detection of wrong data is fairly trivial, even if it's somewhat tedious. In the real world, there are examples where wrong data detection is well nigh impossible unless you get all the help you can from the DBMS. Normalization can help in this regard.

Well, why don't you just makes sure that all the application code always does the right thing? In the first place, the amount of application code grows dramatically, as the scope of the mission increases. In the second place, data errors can creep in by the use of interactive SQL, maybe by the DBA. Writing defensive application code is a good thing, but for one reason or another, it might not get done.

Sorry this is so long, but I needed to give you an idea about what's really going on here. Data management is a never ending battle. Finding and cleaning up wrong data is just part of that battle. It's a dirty job, but somebody has to do it. A good DBMS, coupled with normalization can help.