Extract Transform and Load is the preparation of foreign data to be inserted into your database or data warehouse
Looking at the basics of the ETL, noted Data Warehouse Designer Bill Inmon notes:
Once upon a time in the not so distant past, there was no ETL (extract, transform and load) software. If you wanted to build a data warehouse, you had to write code in order to get data from one source to the appropriate target. There was lots of code – lots of repetitive code.
After you wrote your code, you had to maintain it. Every time a legacy system changed, you did manual maintenance to your code. Every time a target definition changed, you had to do manual maintenance to your code. Every time an end user wanted something new, you had to do maintenance to your code.
From here, a plethora of ETL products proliferated, as Inmon describes in his brief history of ETL products. They were popular because they were software tools designed to extract data from changing systems, transform it according to specific rules, and load it into data warehouses. This software process meant that humans were involved only in the critical element of the loop: untangling loaded data errors. By automating to the maximum possible extent, the ETL process provided companies a seamless way of not only loading their current databases into a data warehouse, but the ability to load future data sets of the same databases in, so that the data warehouse can continuously provide future results.
To answer your question specifically, different databases provide different extracts. Transformations are applied to normalize the data. Normalization is both in the database-specific sense, changing the patterns of the data to match the receiving data warehouse, but also in the human sense, insuring that the same data in different systems appears the same to the incoming system.
Data sources can be anything you can code a transform for, as the purpose of the transform is to apply rules to the incoming data such that it fits your data model. Joining different data sets should only be performed if necessary. Rely on your recipient database instead to synchronize results.
If you have 200 identical sources then you can parameterise a SSIS package with the data source and kick off nultiple threads. These can be controlled within the package by a foreach loop or from an external source that kicks off the extractors with a parameter.
You could consider a full load for relatively small dimensional sources and an incremental load for transactional data. This would require you to have persistent dimensions, but this is fairly straightforward to do with MERGE operations, or a pre-load area and dimension handler if you need slowly-changing dimensions.
You may wish to consider giving each source its own staging area (maybe a schema for each source in the staging database). This eliminates locking issues on the staging tables.
Build a set of views over the staging tables (essentially just set of unions that correspond to each of the source tables) that includes data source information. These can be generated fairly easily, so you don't have to manually cut and paste 200 different queries into the union. Once you've staged the data then ETL process can read the whole lot from the view.
This allows the ETL to run in one hit, although you will have to come up with a strategy to deal with extract failures from individual systems. For this, you might want to look into an architecture that deals with late arriving data gracefully, so you can catch up individual feeds that had transient issues.
BCP
For 200 simple extracts, BCP is probably a good way to go. The sources are all identical, so the BCP files will be the same across sources. You can build a load controller with SSIS. Getting multiple threads to read the top off a common list would require you to implement synchronised access to the list. The SSIS process has a bunch of loops running
in parallel in a sequence container that pop the next item, execute it and update the corresponding status.
Implementing the 'next' function uses a sproc running in a serializable transaction that pops the 'next' eligible source off the list and marks it as 'in progress' within the transaction. This is a 'table as queue' problem, but you don't have to implement synchronised inserts - a whole batch can be pushed into the table at the start of the run.
Structure the individual extract process so that it tries once or twice again if the first attempt fails. This will mitigate a lot of failures caused by transient errors. Fail the task if it fails twice, and structure the ETL so it is resilient to individual extraction failures.
Incremental loads
An incremental loader is probably not worth bothering for dimension tables unless you have a really big dimension that shows real performance issues. For the fact table data sources it probably is worth it. If you can add a row version to the application table with a timestamp column or some such, you can pick up stuff that's new. However, you will need to track this locally to record the last timestamp. If there is an insert or update date on the data you may be able to use that instead.
Full Loads
What could possibly go wrong?
200 processes kicking off to do a full load places a load spike on the network and possibly the staging database. This could lead to all sorts of transient issues like timeouts. For small dimension tables it's probably not such a big issue. However for 100GB there are quite a wide variety of issues - WAN saturation, locking (although the right staging architecture will mitigate that), availability of sources. The longer the extract process has to run the bigger influence environmental factors have on the reliability of the process.
There are quite a lot of imponderables here, so YMMV. I'd suggest an incremental load for
the larger tables if possible.
Best Answer
No.
What you describe are all variants of ELT.
The difference between ETL and ELT is in where you do the "T". The "traditional" ETL flow would implement the "T" (data transformation) outside the DBMS, using a specialized tool like DataStage, Informatica, Talend, etc. The data transformed to the target model would then be simply loaded into the destination DBMS tables.
The ELT flow differs in that it is the destination database engine that performs the "T"; you load raw data into staging tables in the target database, then use the DBMS tools to "massage" them into the final form and copy to the destination tables. It is an appropriate alternative when the source and target data type impedance is relatively low and the DBMS in question has the tools and capacity to do the job.