How to normalize data based only on the repetitions

normalization

Given a non-normalized data table, is there a good a strategy to attempt to normalize based on repeated data? Assume we have no information about the significance of the columns.

In a small example

Col1            Col2         Col3
========================================
Europe          France       Paris
Europe          France       Marseilles
Europe          England      London
Europe          England      Manchester
N.America       USA          St. Louis
N.America       USA          Washington D.C.

I can see that there are three source tables, probably called, Continents, Countries, and Cities. What about for much larger data sets. Say 45 columns and several hundred thousand rows. Looking at the data is obviously not going to help much.

Best Answer

I've had to reverse engineer several existing complex data sets. The most important thing to establish are the keys and dependencies in the data. The problem is frankly NP hard, so some intuition and inspection will go a long way to getting you to a sensible answer, so don't count on a simple "turn the crank" solution unless you've got a lot of time on your hands.

What you need to do is to query the data a column at a time and by combinations of columns. You want frequency distributions for column values (and column combination values). Columns (or combos) with maximum frequencies of 1 are candidate keys.

You can also look at frequency distributions of combinations of columns to find potential hierarchies. In your example each value in Col2 only ever has one value in Col1 and so forth.

When you identify candidate keys and dependencies between columns you can apply normalization.


EDIT: In response to OP's quesiton in comments:

This is a query that would tell you whether or not one column may have a functional dependency on another column:

select COL2 
from MYSTERY
group by COL2
having count(distinct(COL1)) <> 1

If COL2 has a functional dependency on COL1 then this query would return 0 rows. This is because every value of COL2 has exactly 1 corresponding value of COL1.