First Normal Form: Definitive Definition

database-designnormalizationrelational-theory

I am trying to get a definitive version of what is First Normal Form. Everything I read has a slightly different spin.

Many authorities, such as Date, say that by definition a relation is always in First Normal Form, while others give a list of requirements. This means that there are from zero to many requirements for 1NF.

I suppose the difference is that between tables and relations: tables can be a complete mess, while a relationship follows certain restrictions. The fact that a relation is represented as a table in SQL thus creates some confusion.

I am focusing specifically on 1NF as it relates to SQL databases. The question is: what properties are required to ensure that a table is in the first normal form?


Many authorities suggest that if a table represents a relation, then it is already in 1NF. This pushes the definition of 1NF back to the definition of a relation.

Here are some properties of a table in 1NF:

  • Column Order is insignificant [1]
  • Rows Order is insignificant
  • All rows are the same length (ie, row data matches column headers)
  • There are no duplicate rows (this can be guaranteed using a surrogate primary key, but the PK is itself not required)
  • There are no repeating columns
  • Each column contains a single value (atomic)

[1] Technically attributes are unordered, but in a table, the row data must be in the same order as the column headers. However, the actual order is insignificant.

On multiple data:

The concept of atomic data is that an item cannot be further broken down. This concept has been qualified in that although technically everything can be broken down ad nauseum, the data in question cannot be practically broken down any further, depending on how the data will be used.

For example, a full address or a full name should normally be broken down further, but the components such as the given name or town name should probably not be broken down any further, despite the fact that as strings they can be.

As regards repeating columns, it is a poor design column to have nearly repeating columns, such as phone1, phone2 etc. In general, repeated data indicates the need for an additional related table.

Dependence

There should not be any relationship between rows, other than that they conform to the same headers.

There should also be no relationship between columns, but I believe that is the subject of higher normal forms.

The question is: How much of the above is in the definition of 1NF? Does the independent rows bit also come into it?

Best Answer

Preliminary

The definition of normal form (which from the presentation of “Further Normalization of the Data Base Relational Model” in 1971 is known as first normal form) and the definition of the relational paradigm itself was published in 1970 in the scientific paper that provided a strong foundation for the practice of database administration, i.e., “A Relational Model of Data for Large Shared Data Banks” (RM for brevity) created by Dr. E. F. Codd, who is a Turing Award recipient and the authority with regard to the relational framework.

Yes, there is plenty of explanations, interpretations, expositions, deviations and opinions about Dr. Codd’s text, but I personally prefer to stick to the original source and I highly suggest that you analyze it by yourself so that you can draw your own conclusions.

I certainly do not understand the RM in its entirety, but what I do understand of it permits me appreciating its excellence, vision, intention and scope, and although decades later one can note that it has a few slight imprecisions, they do not reduce, in any way, its genius and elegance. In its field, the RM has stood the test of time in a unique way, and remains unmatched.

The act of emphasizing the aforementioned imprecisions would be —using a charitable term— unfair because, seeing it from from a considerable distance, this seminal material required a few refinements and extensions, yes, but the main body of the work was rock solid from the very conception (and, indeed, Dr. Codd made most —if not all— of such refinements and extensions himself).

I continue rereading the RM constantly in order to strengthen my comprehension of this exceptional source of knowledge (and my esteem of it keeps growing on every reread); the objective is to stand on the shoulders of giants.

Relations and tables

It is important to note that as relations are abstract resources, Dr. Codd envisioned the utility of representing them in tabular form (he initially used the term “array representation” but subsequently utilized “table” or “r-table”), so that the users, designers and administrators of a relational database (RDB) can approach them in a more familiar or concrete manner. Therefore, within the context of a RDB implementation, it is valid to use table as a shorthand for relation, as long as said table stands for an actual relation. This feature is —although obvious— quite significant because before evaluating whether or not a table represents a relation that complies with first normal form (1NF), it has to represent, precisely, a relation.

The RM naturally contains the qualities that a table must have to determine if it in fact portrays a relation, but I will offer an informal and unpretentious interpretation about them here (another one, yes!):

  • It must have a name (each particular relation in a database structure must be distinguished from the rest).
  • Each of its rows must depict exactly one tuple of the pertinent relation.
  • The order of its rows is not important at all.
  • Each of its columns must have a name that stands for the meaning of exactly one domain of the concerning relation, and said name must be different from the names of the rest of the columns of the table (a column must be uniquely differentiated and must carry a distinctive meaning and, yes, the role played by a database modeler and the business experts to define each domain of significance with accuracy is paramount)
  • The order of its columns has no significance.
  • All of its rows must have the same number of columns.
  • It must have at least one column, or one combination of columns, that uniquely identifies every one of the tuples depicted via rows; in this way, all of the rows must be different (yes, this stresses the importance of having at least one KEY declared, and when there are two or more KEYs one should be defined as PRIMARY based on pragmatic reasons, while the rest can be deemed as ALTERNATE; but yes, before making the decision, each of the KEYs was a “candidate” for a definition as PRIMARY).

Having a table that in actual fact represents a relation is critical as, when it undergoes manipulation operations of a relational kind, then the result is, again, a table that represents a relation. In this manner, the behavior of said table is predictable.

Atomic domains (columns)

In the first sections of the RM, Dr. Codd presents several samples of relations in order to introduce some concepts; so, in order to comprehend the meaning of atomic domain, let us start with the following excerpt from the RM that details some pertinent points:

So far, we have discussed examples of relations which are defined on simple domains—domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on.

In this way, one can say that each of the aforementioned expository relations fits in one of two kinds, say either kind A or kind B:

  • Kind A groups only relations (tables) that are structured with domains (columns) that contain exclusively simple values in every one of their tuples (rows), i.e., such domains (columns) do not contain relations (tables) as values, which in this context means that the values are atomic because they cannot be decomposed successively into new relations (tables). Hence, the relations of this class are the ones that are normalized, i.e., they comply with 1NF, their form is desirable.

  • Kind B is exclusively integrated by relations (tables) that have one or more domains (columns) that hold relations as values in each respective tuple (row), and that signifies that said values are nonatomic since they can be subsequently broken down into new relations (tables), i.e., they are decomposable. Thus, the relations of this sort are unnormalized, i.e., they infringe 1NF, they are in an undesirable form.

Normalization

Dr. Codd introduces the section about normalization in the RM with the following paragraph:

A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating! There is, in fact, a very simple elimination procedure, which we shall call normalization.

Then he goes on to show:

  1. A group of relations where one is unnormalized (it has domains that contain relations as values, i.e., they are nonatomic; i.e., they are non-simple)

  2. A group of relations that are a normalized (i.e., one that was decomposed; i.e., one which relation valued domains were broken down into simple ones which signifies that they are atomic)

And then he describes the procedure for obtaining normalized relations from unnormalized ones.

In this respect, the relations he employed to illustrate a normalization exercise and the exercise description itself are quite clear, and I recommend again that you analyze them yourself (and I as well hope this encourages some readers to engage with the text).

Sucessively, he indicates:

Further operations of a normalizing kind are possible. These are not discussed in this paper.

And said operations, i.e., second and third normal form (2NF and 3NF) are actually detailed in “Further Normalization of the Data Base Relational Model”, and as mentioned above, after the presentation (and the later printing and publication) of this paper, the original normal form became known as first normal form.

As a practitioner can observe, having unnormalized relations (tables) introduces (almost always unnecessary) convolution into RDB implementations.

A relation that satisfies 1NF, eases the definition of constraints and data manipulation operations which can be implemented by means of a data sublanguage that is less complex than that required for unnormalized relations (tables), as Dr. Codd points out in the following lines:

The adoption of a relational model of data, as described above, permits the development of a universal data sublanguage based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relations is in normal form. Such a language would provide a yardstick of linguistic power for all other proposed data languages, and would itself be a strong candidate for embedding (with appropriate syntactic modification) in a variety of host languages (programming, command- or problem- oriented). […]

[…]

The universality of the data sublanguage lies in its descriptive ability (not its computing ability).

The bewilderment

From my point of view, the bewilderment has arisen, due to (a) the aforementioned excess of interpretations, explanations, etc., about 1NF and the RM itself, and because of (b) further attempts to redefine 1NF that state that having relations with domains that hold values that are, in turn, relations comply with 1NF as long as they are one single value for each corresponding tuple.

My take on your other pointss

There should not be any relationship between rows, other than that they conform to the same headers.

I am not sure if I understand the intention of that statement correctly but, apart from conforming to the same headers, there must be a connection between the (tuples) rows of a relation (table) as each of them should be an assertion about a particular occurrence of the specific entity type (defined in terms of the business context of interest) that the relation (table) is supposed to represent.

There should also be no relationship between columns, but I believe that is the subject of higher normal forms.

I do not know if I am properly interpreting the meaning of that statement either but, in fact, and in accordance with my response to the previous aspect, there must be a relationship between the domains (columns) of a relation (table) as well, which is precisely why it is a relation (the essential structure of the relational model and of a concrete RDB implementation).

To exemplify, with regard to the hypothetical relation (table)

  • Salary (PersonNumber, EffectiveDate, Amount)

the tuple (row)

  • Salary (x, y, z)

would convey the meaning

  • The Salary payed to the Person identified by PersonNumber x, on EffectiveDate y corresponds to the Amount of z

Therefore, each tuple (row) of the Salary relation (table) must fit into the structure of the assertion shown above, and the difference would be the replacement of the pertinent domain (column) values, but there must exist a relationship between (a) all the Salary domains (columns) and also between (b) all their corresponding values with respect to each tuple (row); such a relationship it is indispensable.

Higher normal forms (2NF and 3NF) are useful to get rid of functional dependencies between domains (columns) of a relation (table), they assist in avoiding undesirable connections between domains (columns), as said undesirable connections allow the introduction of update anomalies. Both 2NF and 3NF are helpful to test the soundness of the structure of the relations (tables) in a certain RDB implementation.