Sql-server – How to “normalize” external data before insertion into “Data Warehouse” like database

etlnormalizationsql server

I want to be database agnostic but let's say we're inserting data into SQL Server which is to be used as a data warehouse.

There are MILLIONS of email records from an email platform that wasn't too well though out technically. It's data is not normalized.

So for instance one import might be 7 million records where one important field is a subject line (for testing purposes) and that subject line is "Happy Mother's Day One and All, we love Long Texts!" for all 7 million records.

Now add that "7 million of the exact same long text" to about maybe 50 different subject lines. The benefits of normalization couldn't be clearer with this example. Convert the highly redundant long texts to integers, and then store a small dimensional table translating them back to text. Right?

I'm just wondering the exact implementation of this.

I'm going to use an ETL tool (one or another) but I suppose there are multiple ways of doing it. I guess I would attempt to translate the "subject" into an int based on the small dimensional table, and if no match is found, I would take the distinct new records and append them with an auto-incrementing table? Just wondering if this is a common task.

Normalizing smelly data before insertion into a data warehouse. I don't mean cleaning, auditing, profiling — I mean literal database normalization to shrink the data volume.

Best Answer

The whole process you described is implemented in the database engine itself for the columnstore compression.
DB engine will scan data for repeating text, find duplicates and create dictionary tables. In the row data it will keep only minimal references to dictionary.
Be sure to keep data batches big enough to trigger this compression process during insert - follow this guidance: Columnstore indexes - Data loading guidance