Sql-server – Designing a dimensional DB off a normalized source that already implements SCD’s

dimensional-modelingslowly-changing-dimensionsql serverssasssis-2014

I have built an SSIS ETL to integrate various data sources (one from MySQL, two from SQL Server) into a single SQL Server relational and normalized database, which I've called [NDS].

The SSIS ETL handles type-2 updates, and so the [NDS] generates surrogate keys and SCD tables include an [_EffectiveFrom] timestamp and a nullable [_EffectiveTo] column, and there are constraints for the natural keys and beautiful foreign keys linking all the data together.

Now, I wanted to build an SSAS dimensional database off of it, and it didn't take too long before I realized I was setting myself up for a snowflake schema:

yUML diagram depicting a sample of the table relationships

So I'm thinking of adding a new [DDS] (relational) database, to create the actual dimension and fact tables that will feed the DSV's for the SSAS db.

This [DDS] database would be as denormalized as humanly possible, so as to "flatten" the facts and dimensions (like, [OrderHeaders]+[OrderDetails] into an [Orders] fact table, and [CustomerStores]+[Customers]+[SalesReps] into some [Customers] dimension table) – doing this should not only make it easier for me to build the dimension hierarchies in SSAS, it should also make it easier to come up with an actual star schema.

I have a few questions though:

  • Can I reuse a subset of my existing surrogate keys? I'm thinking to take the existing key for the most granular level and make that the dimension key. Is that a good approach, or should I just ignore the [NDS] surrogate keys and make the [DDS] (relational db) generate a new set of surrogate keys?
  • How to handle SCD's? For example, "Materials" and "Suppliers" will generate new records in [NDS] when some specific fields change in the source system… I think I'll have to design the SSIS ETL to only load the "last image" records into the [DDS] db, and then re-implement the type-2 updates in that process, i.e. treat the [NDS] as a "source system" that keeps history, while duplicating everything in this [DDS] database. But then, why would I need to keep history in the [NDS] and the [DDS]? Clearly something's not right.

Am I setting myself up for a Big Mess™, or I'm on the right track?

Best Answer

As you are seeing, one of the benefits of Kimball Dimensional Modeling is that the data warehouse design is essentially your SSAS design. While there are always exceptions, you can typically select a table in the DSV and immediately move to hierarchy design, cube relationships etc.

I'd recommend moving to the DDS with the caveat that you phase out the NDS. For the simple reason that you mention with the SCD Type II - there's no reason to duplicate all of that data, ETL, and code base. Keeping both leads to an overly complex solution with a lot of maintenance and risk - this is the primary Big Mess™ you need to avoid.

Here's the justification:

  • The DDS sounds like a Kimball dimensional design that will be easily understood and maintained by any new hires and contractors
  • There are lots of proven designs for schema and ETL that go along with this, making it easier to design and maintain
  • Easier to deal with large data quantities via options like fact partitioning and column store indexes
  • Easier DW queries in the future for yourself and power users
  • Simpler overall design

The alternative design I'd suggest only if you have small quantities of data, no significant expectation of increases, and no capacity to switch to the data warehouse. This suggestion ultimately includes a lot of the design work that would otherwise go into the DW and as such is similar to keeping the NDS and creating the DDS:

You can create your cubes with queries manifested as Views or hard-coded into the DSV.

It sounds like you overcame a lot of hurdles and created a great functional solution with the NDS. Unfortunately what it doesn't do are two important things that dimensional modeling provides: simple query patterns and easy translation into multidimensional analytics. Thankfully, a lot of the ETL design is likely useful as a template or starting point for loading a different table structure.