If you have 200 identical sources then you can parameterise a SSIS package with the data source and kick off nultiple threads. These can be controlled within the package by a foreach loop or from an external source that kicks off the extractors with a parameter.
You could consider a full load for relatively small dimensional sources and an incremental load for transactional data. This would require you to have persistent dimensions, but this is fairly straightforward to do with MERGE operations, or a pre-load area and dimension handler if you need slowly-changing dimensions.
You may wish to consider giving each source its own staging area (maybe a schema for each source in the staging database). This eliminates locking issues on the staging tables.
Build a set of views over the staging tables (essentially just set of unions that correspond to each of the source tables) that includes data source information. These can be generated fairly easily, so you don't have to manually cut and paste 200 different queries into the union. Once you've staged the data then ETL process can read the whole lot from the view.
This allows the ETL to run in one hit, although you will have to come up with a strategy to deal with extract failures from individual systems. For this, you might want to look into an architecture that deals with late arriving data gracefully, so you can catch up individual feeds that had transient issues.
BCP
For 200 simple extracts, BCP is probably a good way to go. The sources are all identical, so the BCP files will be the same across sources. You can build a load controller with SSIS. Getting multiple threads to read the top off a common list would require you to implement synchronised access to the list. The SSIS process has a bunch of loops running
in parallel in a sequence container that pop the next item, execute it and update the corresponding status.
Implementing the 'next' function uses a sproc running in a serializable transaction that pops the 'next' eligible source off the list and marks it as 'in progress' within the transaction. This is a 'table as queue' problem, but you don't have to implement synchronised inserts - a whole batch can be pushed into the table at the start of the run.
Structure the individual extract process so that it tries once or twice again if the first attempt fails. This will mitigate a lot of failures caused by transient errors. Fail the task if it fails twice, and structure the ETL so it is resilient to individual extraction failures.
Incremental loads
An incremental loader is probably not worth bothering for dimension tables unless you have a really big dimension that shows real performance issues. For the fact table data sources it probably is worth it. If you can add a row version to the application table with a timestamp column or some such, you can pick up stuff that's new. However, you will need to track this locally to record the last timestamp. If there is an insert or update date on the data you may be able to use that instead.
Full Loads
What could possibly go wrong?
200 processes kicking off to do a full load places a load spike on the network and possibly the staging database. This could lead to all sorts of transient issues like timeouts. For small dimension tables it's probably not such a big issue. However for 100GB there are quite a wide variety of issues - WAN saturation, locking (although the right staging architecture will mitigate that), availability of sources. The longer the extract process has to run the bigger influence environmental factors have on the reliability of the process.
There are quite a lot of imponderables here, so YMMV. I'd suggest an incremental load for
the larger tables if possible.
If you want to de-normalize this into a single fact table, the fact table is going to be about line items. Therefore, the facts from DataSourceAHeader need to be split up and distributed to the relevant line items so they are not duplicated. As it is currently presented, that means dropping your total order cost and calculating this by summing the line item costs.
The DataSourceAHeader dimension keys (e.g. order date) can be taken from DataSourceAHeader and applied to the fact rows generated from DataSourceBLine. In the example there doesn't seem to be any info contained on DataSourceALine which isn't already included on either DataSourceAHeader or DataSourceBLine, but if there is this can be mapped across in a similar way.
This approach relies on a number of assumptions, the key one being that all the facts from DataSourceAHeader can accurately be distributed among its constituent line items. If this isn't true, loading two separate fact tables (one for the order and one for the line items) might well be a better approach. The same might be true if there are a lot of questions to be asked about orders, which do not consider line item specific info. This is labelled as "Bad Idea #2" in the article which you've referenced, but I have found that in certain circumstances, it's actually a good idea.
Finally this assumes that the two data sources are in sync. If they're not, you'll be limiting yourself to loading data at the pace of the slower data source. This might be fine, but needs to be considered in the context of your needs and the difference between the two data sources.
Edit: De-normalizing into a single fact table may significantly impact performance when counting orders, as it's essentially a distinct count, which would be my main reason for considering two separate fact tables.
Edit 2 (in response to question edit):
Here, the issue is that at the most granular level (line) data is incomplete, in as much as not all rows have a cost value. However, the total cost information is available at the next level up (header).
This presents the situation where you cannot derive the higher level from the lower; let’s consider the resulting options:
- Have a single fact table at the lowest granularity available (line). This is a non-starter, as we are now relying on the incomplete line data to answer questions at the higher level, which we know we could have answered.
- Have a single fact table at the higher granularity (header). This means we can now answer questions at the higher level with the complete data, but can no longer answer questions at the more granular level at all. This may be considered to be acceptable, but in most cases we are throwing away potentially valuable data.
- Have two related fact tables, one for the incomplete, more granular data (line) and one for the complete, less granular data (header). This is the ideal solution, as we can now answer questions at the higher level in full, and can give the best possible answer to questions at the lower level, given the incompleteness of the source data.
This question was raised because of doubts about having two related fact tables. The doubts stem from the fact that maintaining and joining two large fact tables can be resource intensive. That's true, and if your most granular information can be used to provide a full description of the situation then using a single fact table is preferable. However, in situations like this where that's not possible, two fact tables are required if you want to preserve as much information as possible.
Best Answer
The value of having a separate source database is that if your transformation process fails, you don't have to go to the external sources to pull the data again. You can simply fix the transformation issue and start at the transformation step, skipping the data load.
If the time that is required to pull the data from the external sources is insignificant, then it may not be worth the extra effort. You simply have to weigh the various factors to determine what's best for the project:
Using a source database could also speed up your testing processes since you can test the load of the source database and the transformation processes separately.