How to normalize data based only on the repetitions

normalization

Given a non-normalized data table, is there a good a strategy to attempt to normalize based on repeated data? Assume we have no information about the significance of the columns.

In a small example

Col1            Col2         Col3
========================================
Europe          France       Paris
Europe          France       Marseilles
Europe          England      London
Europe          England      Manchester
N.America       USA          St. Louis
N.America       USA          Washington D.C.

I can see that there are three source tables, probably called, Continents, Countries, and Cities. What about for much larger data sets. Say 45 columns and several hundred thousand rows. Looking at the data is obviously not going to help much.

Best Answer

I've had to reverse engineer several existing complex data sets. The most important thing to establish are the keys and dependencies in the data. The problem is frankly NP hard, so some intuition and inspection will go a long way to getting you to a sensible answer, so don't count on a simple "turn the crank" solution unless you've got a lot of time on your hands.

What you need to do is to query the data a column at a time and by combinations of columns. You want frequency distributions for column values (and column combination values). Columns (or combos) with maximum frequencies of 1 are candidate keys.

You can also look at frequency distributions of combinations of columns to find potential hierarchies. In your example each value in Col2 only ever has one value in Col1 and so forth.

When you identify candidate keys and dependencies between columns you can apply normalization.

EDIT: In response to OP's quesiton in comments:

This is a query that would tell you whether or not one column may have a functional dependency on another column:

select COL2 
from MYSTERY
group by COL2
having count(distinct(COL1)) <> 1

If COL2 has a functional dependency on COL1 then this query would return 0 rows. This is because every value of COL2 has exactly 1 corresponding value of COL1.

Second normal form

If I understand you correctly, the combination of {date, compound_type, location, method} uniquely identifies {value, units}, and all four are needed in order to identify a unique sample ({date, location, method} isn't enough by itself, for example).

I'm going to write this as if I hadn't received an answer on my question about functional dependencies, since other people might be interested in an explanation of both possibilities.

If there are no partial dependencies

1) Assuming none of the non-prime attributes {value, unit} depend on part of the candidate keys {id} or {date, compound_type, location, method}, your table is in 2NF since, as Wikipedia puts it, "every non-prime attribute of the table is either dependent on the whole of a candidate key, or on another non-prime attribute."

If there are partial dependencies

2) One or both of the non-prime attributes {value, unit} depend on only parts of the candidate key {date, compound_type, location, method}. You've confirmed this is the case with {compund}->{unit}, so your table is not in 2NF.

In order to fix the violation of 2NF, I would suggest moving {unit} to the compound table, which I'm guessing would end up looking something like this: {id, name, unit}. Here, the candidate keys are {id} and {name}. Since there are no composite candidate keys, the table is automatically 2NF. It's also 3NF since there are no transitive dependencies, I.E. there's no attribute that's dependent on unit.

Third normal form

OK, that leaves us with the samples table looking like this: {id, date, compund_type, location, method, value}. The two candidate keys are {id} and {date, compund_type, location, method}, which leaves {value} as the single non-prime attribute. Assuming that there are no more 2NF violations (you can't use a subset of {date, compund_type, location, method} to uniquely determine value), we can check the table for violations against 3NF.

3NF states that every non-prime attribute (attributes that don't belong to a candidate key) must be directly dependent on every superkey. Since we only have one non-prime attribute, {value}, it's impossible for the table to violate 3NF, since there's no non-prime attribute for {value} to be dependent on, and no non-prime attribute that can depend on {value}.

I'm going to leave discussions about BCNF out of this for simplicity.

Surrogate key vs. natural key

As for your other questions: "is using an id(PK) column like I have above the best way to go with all of the repeating dates?" I think so. Semantically, the surrogate key id isn't necessary, but it does help keep things simple. I'm not sure how MySQL works underneath the hood, but in other DBMSs composite primary keys with non-integer data types can lead to unnecessary overhead for example when indexing. Another problem with composite keys is that it gets annoying to query them.

Imagine that you need to add information about which labs each sample was sent to. A sample can be sent to several labs and each lab can receive several samples, so you create a table to connect the two tables. Would you rather write this

SELECT *
FROM samples s
JOIN labs_samples ls ON 
    s.date = sl.date,
    s.compund_type = sl.compund_type,
    s.location = sl.location,
    s.method = sl.method

or this

SELECT *
FROM samples s
JOIN labs_samples ls ON s.id = ls.id

Mysql – How to normalize data en-masse

This answer assumes that users.login is unique.

First, alter the users table to include an address_id column.
Next, if you add a addresses.login column you would then be able to do the following query after the insert into addresses:
```
UPDATE users, addresses SET users.address_id = addresses.address_id 
WHERE users.login = addresses.login
```
Finally drop the addresses.login column

An alternative would be to use an AFTER INSERT trigger on addresses to update the users table based on the unique address entries. Unfortunately, this assumes that all the fields of addresses combined represent a unique users.login, which for example would be invalid if a married couple each had logins but shared the same address.