Database Design – How to Know When to Stop Normalizing

database-designnormalization

Alright, so basically I feel that I tend to over-normalize things, and maybe I am doing it at the cost of performance. So, to illuminate on the problem, I have created the following schema to use as an example:

enter image description here

As you can see, I have outlined two different approaches. The idea here is that all universities have programs (e.g., Engineering), and all programs have majors (e.g., Electrical Engineering). For this example to work, we must assume that there are 40 programs, and say 1,000 majors, and that schools have the same programs/majors.

Now, my typical approach in this scenario is to take anything that may be repeated (i.e., majors and programs), and put those items into their own table; then have a relationship, as modeled above. Another approach, one I tend to stay away from, is the second model, in which program and major are columns with repeated values (e.g., Engineering may be repeated 1,00s of times over the table). Basically, if the value is repeated, I create a table for it.

Now, I'm not so much interested in which one of these is the better approach, since I am only using them as an example to illuminate the true question: How does one know when they are over-normalizing? I know that there is a point when you're going too far in normalizing your tables, but I never quite know what the measurement is.

Addendum

A university need not have all the majors in a program, thus the reason universities are tied to majors, not programs (e.g., University X has an Engineering school but does not have Nuclear Engineering, which is part of the Engineering program).

Best Answer

When designing a database, normalization is a process undertaken when producing the logical model i.e. the relational data model. Performance is an attribute of the physical model; the implementation of the logical model on a particular DataBase Management System.

You cannot "over-normalize" a relational data model. Either the relational data model is in fifth normal form or it is in a lower normal form and update anomalies exist.

The normalization process is a well-defined procedure for analyzing the functional, multivalued and join dependencies of the relations within the model and taking projections of these relations to eliminate any dependencies not implied by the candidate key.

Assuming you have a relational data model in fifth normal form which has been implemented on a particular DBMS. At some point the performance of this implementation has become unacceptable. How you decide to address this issue will depend on the DBMS that is being used and the features that DBMS makes available.

Having looked at the features available to you for the DBMS being used you may decide to take some of tables in the DBMS and re-factor them so that they are in a lower normal form. (Note: the logical model has not changed, you have changed the physical model to address a performance issue with the implementation.) By re-factoring these tables you have introduced the possibility of update anomalies and, because your relational data model is in fifth normal form, you can determine precisely where these may occur. Having determined these you may choose to ignore them, or add some additional processing to either identify them and/or address them. (Note: by adding additional processing you may impact performance such that the perceived benefits of "denormalizing" these tables is actually negated.) In any case you are able to make an informed analysis of any action you may take.