Tool to check if the database is normalized to the third normal form

database-designdatabase-recommendationnormalizationschema

I learned about normalization recently, and understand how important it is when implementing a new schema.

How can I check if my database is 2NF or 3NF compliant ?

Manual review is a sure option, but I'm looking for an automated tool here.

I'm not looking for a point-and-click tool, more something that would highlight possible optimizations to make a table 3NF compliant. I guess it might use statistics based on good sample data and/or semantic analysis of columns names.

Best Answer

Normalization absolutely is used in the real world... and hopefully you know that 3NF is only the third one of... what is is now, 8? But 3NF should be an easy target.

However... I would venture to say that there could not be such a tool.

Normalization, technically, is an attribute of each table. Within a given database, different tables may have different levels of normalization.

Each table represents facts... facts about instances of a certain type of thing (person, account, order, shipment, item, location) including, sometimes, foreign keys which lead you to other kinds of facts about that thing.

Normalization has to do with how accurately and efficiently facts are represented in the tables as well as the ability of table's design to prevent ambiguous and redundant data patterns.

Thus, an understanding of the actual facts is required... which is outside the scope of automated tools.

Q: Is a table with { student, subject, instructor } in 3NF?
A: What are students, subjects and instructors?

In a world where all instructors taught all subjects and each student could take any combination but not more than one course on each subject from each instructor, this table could indeed be said to be in 3NF. In the real world, making the claim of 3NF for this table is absurd.

To understand that is isn't in 3NF requires an understanding of the nature of the facts it represents. In our reality, this table is not going to be 3NF since (among other reasons) the subject and the instructor are associated together in ways that have nothing to do with the student. If we have the courses where instructors teach subjects stored elsewhere in our database, why would we copy both values here instead of a foreign key from the other table indicating that the student was signed up for the course? If the instructor is replaced, we have to change multiple records in multiple places.

The more normalized a database is, the more intrinsically consistent it is with the real world and with itself, and the more difficult it is for the database's facts to be inadvertently untrue. Database design is an art, but it is most definitely a science as well.

Even though I do not see eye-to-eye with everything he writes, I would recommend Chris Date's book, Database Design and Relational Theory: Normal Forms and All That Jazz which goes into excruciating detail about the underlying theory of the relational model.

Related Solutions

Can a 2NF relation be M:M

The normal forms don't really have anything to do with many-to-many relationships. If you happen to lose some as a byproduct of the normalization process, that's fine, but you won't generally do so. If we consider that we have two tables: Salesman and Product which each have ID fields as their primary key and we have a third table called Specializes which shows which Salesmen specialize in selling which products. This Specializes table would represent the many-to-many relationship since each salesman can specialize in multiple products and each product can be specialized in by more than one salesmen. It would probably look something like this (excuse the awkward formatting, we can't do real tables on StackExchange):

| SalesmanID | ProductID |
|------------|-----------|
|  1         |  1        |
|  1         |  2        |
|  2         |  1        |
|  2         |  2        |
|  2         |  3        |
|------------|-----------|

Obviously, the lack of nulls and repeated rows means that this table is in 1NF. In this table, the only candidate key is {SalesmanID, ProductID} and as such, there are no non-prime attributes. It also contains no non-trivial functional dependencies. Thus, it is necessarily in 2NF, 3NF, and BCNF. I'm also going to assert without proof that it's in 4NF, 5NF, 6NF, and DKNF (to avoid having to explain all of the details thereof). So really, no normal form removes many-to-many relationships, nor are they meant to. The purpose of normal forms is not to remove many-to-many relationships (and I am not actually clear on why you would want to) but rather to remove potential insert anomalies, update anomalies, and deletion anomalies. The primary role of the normal forms is to ensure that each piece of information is represented in a database table precisely once. Having the same information embedded in multiple places leads to problems. But that has nothing to do with many-to-many relationships.

I think that you mean to be asking a slightly different question about something like situations where a many-to-many relationship is embedded in a table which also tries to contain other information, such as if the table above contained the product name in addition to the product number (where name is functionally determined by number). A table like that would either violate 2NF (if the name did not also functionally determine the number) or Boyce Codd Normal Form (if the name did functionally determine the number).

You could also perhaps be thinking of a different situation: when we have two unrelated 1:M relationships in the same table, such as if we were to add a third column to identify which language or languages each salesman speaks.

| SalesmanID | ProductID | Language |
|------------|-----------|----------|
|  1         |  1        | English  |
|  1         |  2        | English  |
|  2         |  1        | Spanish  |
|  2         |  2        | Spanish  |
|  2         |  3        | Spanish  |
|  2         |  1        | French   |
|  2         |  2        | French   |
|  2         |  3        | French   |
|------------|-----------|----------|

As you can see, that table is quite problematic, since we need 6 entries to express that Salesman 2 specializes in 3 products and speaks 2 languages. This is a fourth normal form violation.

Edit:

Upon clarification, it's clear that what he's asking about is a table like the Specializes table, but with extra information about the salesmen and the products, essentially, a table which contains two entity sets and their many-to-many relationship in a single table. So to answer that question directly, yes, you can have lousy tables like that which are in 3NF. The normal form which guarantees that that won't happen is Boyce-Codd Normal Form (BCNF). Here's an example of a lousy table like that which is vulnerable to all kinds of anomalies (insert, update, and delete), but is in 2NF and 3NF.

| SalesmanName | SalesmanID | ProductID | ProductName |
|--------------|------------|-----------|-------------|
|   Alex       |  1         |  1        |  Thingy     |
|   Alex       |  1         |  2        |  Whatsit    |
|   Barb       |  2         |  1        |  Thingy     |
|   Barb       |  2         |  2        |  Whatsit    |
|   Barb       |  2         |  3        |  Whoosit    |
|--------------|------------|-----------|-------------|

So, looking at this table, it's obviously in 1NF. Further, we can identify the non-trivial functional dependencies very straightforwardly. SalesmanName -> SalesmanID. SalesmanID -> SalesmanName. ProductID -> ProductName. ProductName -> ProductID. Next we need to identify the candidate keys. There are four: {SalesmanName,ProductID}, {SalesmanName,ProductName}, {SalesmanID, ProductID}, and {SalesmanID, ProductName}. As such, we have no non-prime attributes. Thus, we are necessarily in 2NF (no functional dependencies between non-prime attributes) and 3NF (no non-trivial functional dependencies where the left-hand-side is not a super key and the right hand side contains a non-prime attribute). However, we are not in BCNF because there do exist non-trivial functional dependencies whose left-hand-side is not a superkey.

Any similar situation will also always not be in Boyce-Codd Normal form because there will be some non-trivial functional dependency whose left-hand-side is not a superkey. Any table like this will essentially have two entity sets each of whom have some attributes. Basically, it will have a left entity set and a right entity set. The left entity set will have some attributes which uniquely identify each left entity and the right entity set will have some attributes which uniquely identify each right entity. Those will be involved in functional dependencies. However, they will each not be candidate keys because you'll have to combine them to get a candidate key for the whole table. As such, they won't be superkeys and there will be a Boyce-Codd Normal Form violation. So BCNF will stop it cold. Anything less than that will only catch some cases. Really, if you only remember one normal form, it should be BCNF.

Is this relation in third normal form (3NF)

In the end whether the table is 1NF and 3NF depends on what the domain value of a city is. This answers whether or not you have atomicity.

In essence what normalization means in the abstract regarding address information is quite controversial and it is not an easy area to answer and so I personally believe that this must be addressed on a case by case basis, based on the semantic use of the data.

Otherwise consider the following addresses:

418 N Bradley St
Chelan, WA 98816

How many atomic fields would this require with maximum decomposition? So we consider this to be:

CREATE TABLE address (
    id int not null unique,
    street_segment char(1),
    street_name varchar not null,
    city_name varchar,
    state_province varchar,
    mail_code
);

That seems straight forward enough until you get an address that requires a cross-street as is common in South America, then we start to add fields. But now you have to track an address in Managua like:

Bo Altagracia km 3 1/2 carretera sur | Montoya 3c al Oe

Note that this is an ordinal address based on landmarks, distance and direction, and it does not fit into this data model at all and a normalized representation would require at least one additional table and some additional fields, and we can no longer enforce the same not null constraints.

But then this doesn't necessarily mean there is anything breaking 1NF if we just store the address as a text string up through and including the country.

If you have no reason to track countries, then one may see "London, England" as a valid representation of a city designation and then yes, we could see this as 1NF. In essence such a table has a city domain (where "London, England" is distinct from "London, UK" and "Paris, France" is distinct from "Paris, North Carolina"). This gives you a number of possible issues, but it does not pose classic normalization problems unless you need to track countries (here the country is just an incidental part of the city's domain value, not a domain in itself).

So in this case it may or may not be 1NF depending on how you define the domain of the address to be and what you have to track it relative to. If it is 1NF then it is also 3NF.

However if you break city and country into separate fields, then you have a normalization problem because city is dependent on country, and so to be truly 1NF you'd probably need to break out your regional hierarchies quite a bit more (a cities table, a state/province table, a country table).

A typical approach to this is exactly what you have done, which is to treat the address as a text string atomic domain, and not break things out any further.

Edit: Another example of 1NF problems using arrays

Suppose on PostgreSQL I am storing IP addresses and will be querying on octets. I might represent an IP address as a smallint[] array like: array[192,168,1,101,24] instead of a cidr representation like '192.168.1.101/24' and this does not break 1NF. Each array of smallints is distinct in its domain and each represents a single value of its domain. This does not break 1NF because in the domain of an IP address, each array represents a single value in its domain (and this is ensured by the fact that ordinality is important). This is a good example of why it is wrong to assume, for example, that the inclusion of complex data structures or arrays necessarily violates 1NF.

Finally if "this datatype can be decomposed and therefore it isn't atomic" breaks 1NF then so does every use of a datetime datatype....

TL;DR

1NF's atomicity requirement is violated only when the column stores two or more values within a given domain. That is not the case in the initial question so 1NF is not violated given the information shown. Given that 1NF is not violated and both name and address are functions of customerid, and not functions of eachother, the requirements of 3NF are met.

Best Answer

Related Solutions

Can a 2NF relation be M:M

Is this relation in third normal form (3NF)

Related Question