How to use an ETL

etl

I know that those letters mean Extract, Transform, and Load.

But, when I used it at first, I thought that during the Transform phase I could do plenty of different joins on data that I've extracted from data sources, later on I realized that doing a join on a different ETL is not that handy.

  • So what do we do in Transform phase ?
  • Calculate and output the result ?
  • String transformation?
  • Should input data sources only be csv, xml or plain file?
  • If joins are not that handy, should we only do high level transformation within an ETL ?

Thank you

Best Answer

Extract Transform and Load is the preparation of foreign data to be inserted into your database or data warehouse

Looking at the basics of the ETL, noted Data Warehouse Designer Bill Inmon notes:

Once upon a time in the not so distant past, there was no ETL (extract, transform and load) software. If you wanted to build a data warehouse, you had to write code in order to get data from one source to the appropriate target. There was lots of code – lots of repetitive code.

After you wrote your code, you had to maintain it. Every time a legacy system changed, you did manual maintenance to your code. Every time a target definition changed, you had to do manual maintenance to your code. Every time an end user wanted something new, you had to do maintenance to your code.

From here, a plethora of ETL products proliferated, as Inmon describes in his brief history of ETL products. They were popular because they were software tools designed to extract data from changing systems, transform it according to specific rules, and load it into data warehouses. This software process meant that humans were involved only in the critical element of the loop: untangling loaded data errors. By automating to the maximum possible extent, the ETL process provided companies a seamless way of not only loading their current databases into a data warehouse, but the ability to load future data sets of the same databases in, so that the data warehouse can continuously provide future results.

To answer your question specifically, different databases provide different extracts. Transformations are applied to normalize the data. Normalization is both in the database-specific sense, changing the patterns of the data to match the receiving data warehouse, but also in the human sense, insuring that the same data in different systems appears the same to the incoming system.

Data sources can be anything you can code a transform for, as the purpose of the transform is to apply rules to the incoming data such that it fits your data model. Joining different data sets should only be performed if necessary. Rely on your recipient database instead to synchronize results.

Related Question