Sql-server – Data warehouse staging architecture

Architecturedata-warehousesql serversql-server-2012

This is a question about data warehouse design. We are setting up a healthcare datawarehouse and starting with 2 major source systems that combine for about 20,000 tables and 2 TB of data. 1) It is highly dimensional data 2) We don't wan't to heavily effect OLTP systems

We have chosen an incremental Kimball design. My question is, should all of the data be staged, then sorted into inserts/updates and put into the data warehouse. Then the staging data would be cleared for the next incremental load.

This leaves you with 1 copy of the data.

The other method would be to incrementally load it into staging, sort it into inserts/updates and store it in the same format as the source systems. Then we would combine data from the source systems into the datawarehouse from the full copy.

This would essentially leave you 2 copies of the data, one in the form of the source systems and 1 loaded into the actual datawarehouse.

What is the best practice for this? I originally thought it would be best to only store the copy in the data warehouse and clear the source tables each load.

However, in that case if you ever have to go back to an existing dimension and add a column you would have to re-load all the dependent source tables. Plus you would lose history?

It just seems really inefficient to store it twice though….just wanted some thoughts on the design, your experiences and best practice.

Best Answer

Personally i have staging tables for extract, transform and persistent data storage.

Whether you do full exports or incremental loads will depend on what tools you have, your strategy and whether your app schema and data support it. Sometimes you cant avoid full exports.

Adding a column to a dimension isnt a big deal, but backfilling historic data could be very difficult or may not be possible at all. Trying to reconstruct how an app looked at a point in time retrospectively would be a major undertaking. You would need a very good case to justify that.

All of the things you mention are possible, but only you can decide if the cost/benefit is worth it.