Why is it considered that a set unnormalizes a database absolutely

normalizationrelational-theory

In order for a relationship to be on 1NF it needs to have all values as atomic, if there's a set it's not even in that first normal form:

But intuitively I think that a table with that set would be more normalized than one that is not as long as the values of that set are just used as attributes of the entity.

For example, let's imagine this table about paintings:

Painting_name, Author, Used technique, Used colors

Now if we get to use a set of colors like {blue, Green, yellow, black, White, purple} we get a table that is not even in 1NF.

If we pass the table to 1NF, then we need to have 6 rows each one repeating the Painting_name, Author, and Used technique.

This looks way more unnormalized than the table that is not even in 1NF, and I don't see why having a set there would harm any possible normalization, as that sets would only be used in that table.

What is then the reason that atomic values are needed in order to have a normalized table?

Best Answer

The reference for this post is an amazing book called Database System Concepts 6th Edition which I recommend you read.

In the book, page 328, it states:

A domain is atomic if elements of the domain are considered to be indivisible units. We say that a relation schema R is in first normal form (1NF) if the domains of all attributes of R are atomic.

You might be wondering "but why!?", it's best explained using a practical example.

Let's look at your example with colors. Lets say we have 2 scenarios, 1.) where the table is not in 1NF, 2.) where the table is in 1NF.

1.)

Id          Painting_name                  Author                         Used_colors
----------- ------------------------------ ------------------------------ ---------------
1           Some_Painting                  John                           Blue, red, Yellow
2           Monalisa                       Leonardo da Vinci              orange, black, White, red, Yellow

Although this might seem intuitive to you, consider what happens when you want to query this table. One, you have inconsistent upper and lower case (which you would have to check with your query for both cases) two, if used_colors isn't an array, you would either have to convert it to an array or use extra steps just to check the data you need (like using a string_split function in SQL Server 2014 and later).

This causes performance issues and ends up being a hassle every time you want to check for something. If you are wondering what stops a person from typing in black_white_yellow instead? That question is answered in 2NF and 3NF, foreign key constraints etc...

2.)

Id          Painting_name                  Author                         Used_colors
----------- ------------------------------ ------------------------------ ---------------
3           Some_Painting                  John                           Blue
4           Some_Painting                  John                           red
5           Some_Painting                  John                           Yellow
6           Monalisa                       Leonardo da Vinci              orange
7           Monalisa                       Leonardo da Vinci              black
8           Monalisa                       Leonardo da Vinci              White
9           Monalisa                       Leonardo da Vinci              red
10          Monalisa                       Leonardo da Vinci              Yellow

In this case, every single row is atomic and unique. What did we gain by doing this? You don't have to think about handling arrays, since every single row now is atomic you can clearly check for what you need quickly and efficiently.

And just to give you an idea of how big of a hassle it could become, here are some posts related to the issue of searching for values in a comma separated column:

There are also many different approaches, none of which are really elegant to say the least (at least pertaining to the problem of 1NF). So, basically by using the 1NF you go from using code like the ones mentioned in those posts above to simple stuff like:

SELECT * FROM Paintings WHERE Used_colors LIKE 'BLUE'

Which helps in both readability and performance.

There is one thing you need to keep in mind, 1NF is only the starting point of the normalization process. The 1NF alone is practically never used in any DB out there, after the 1NF comes the 2NF where you would have to split this table into two separate tables. Used_colors will be made into its own table called Colors. At this point you will reach the cardinality issue which is also covered in the book mentioned above.

One last thing, there are many cases where you're going to come across databases that break the 1NF on one or more tables while adhering to the rules of 2NF, 3NF and even 4NF. For example PostgreSQL has json data type which immediately breaks the 1NF rule (as you can save multiple keys and values in json). The general rule of thumb goes like this: Always follow the normal forms unless you really know what you're doing. Since the moment you introduce such variables you could cause inconsistencies throughout the database and you will most likely lose on performance.

Also, as Paul said in the comment below, there is another point of view which Christopher. J. Date supports(who is a famous researcher and writer on relational DB theory as well as being one of the people with Ted Codd who helped push the relational model). This outlook stabs at the idea of atomic values in 1NF, saying that the whole term "atomic" is ambiguous. The idea behind this is simple, you can say that almost no data type is atomic under the current definition of 1NF. To explain this, lets take a look at an example:

Lets say you have a string Hello World. You have functions in every DBMS that decomposes this string into smaller chunks (like the SUBSTRING or LEFT / RIGHT functions) which means that string isn't really an atomic value as everything can be decomposed. This outlook is similar to data types such as json, you can decompose both string and json, so why is one considered atomic while the other not? That's why C. J. Date considers the current definition of 1NF as ambiguous, since almost all data types can be decomposed in one way or another. If you access json or xml data types as a whole unit without decomposing them, there is nothing to say that json or xml data type isn't atomic.

You can find an interesting paper on The Third Manifesto, they (C. J. Date and Hugh Darwin) made their paper on Databases, Types and the Relational Model The Third Manifesto publicly available. It's also an interesting read which will send you to interesting other articles and topics in general.