Should I snowflake or duplicate it across the facts

data-warehousedatabase-designolapssasstar-schema

I'm building up a data warehouse to be used by SSAS to create cubes on, and I'm debating between two possible schemas.
In my warehouse, I've got two different fact tables that tracking daily changes in dollar values. Each entity in these fact tables have an underlying Sales Order and Line to which they relate. These SOs and Lines then have other related dimensions, such as customer, product, etc. About 12 sub-dimensions total so far.

My question is if I should be rolling all these sub dimensions up directly into the fact tables, or if I should use a little snowflaking in my warehouse, and have them branching off the Sales Order and Lines dimension instead.

The first option obviously follows a star-schema model better. However, if changes are made such as adding additional dimensions, it becomes more maintenance, basically having to do the ETL twice for each fact table, rather than just the once on the SO dimension. As well, if a new fact is added that relates to Sales Orders, I'd have to go through the whole process again.

As this is my first DW/OLAP project, I'm not familiar on where the line should be drawn on snowflaking, and other people's thoughts would be highly appreciated.

Best Answer

The "header - detail" pattern is very common the domain of sales transactions.

To answer your question, there are so many factors that will come into play which you've not discussed. For example:

  • If your DW infrastructure has a great deal of RAM and is on SSD storage, reads in this case are cheap, so it might make sense to denormalize some dimensions in the interest of usability.
  • What are the use-cases of the data? In this case I can probably make assumptions - it is sales data. It'll be used for accounting, executive reporting, predictive analysis, customer service, and for just about every possible ad-hoc query you can imagine.

One general principle I use when deciding whether to snowflake a dimension or simply include it's value in the fact table is this:

  1. If the dimension has many attributes which might be useful for reporting (or if there will be report(s) solely on that dimension), I create a dimension for it.

Example: Consider the CUSTOMER dimension. A sales order has a customer, but there are other attributes which belong with the CUSTOMER dimension which you might want to report on, like customer location, customer age/sex/marital status, customer type, customer create date, etc., and many other customer-related attributes. I wouldn't put all of these in a fact table, so in this case I "snowflake" to a customer dimension as there are many more attributes related to CUSTOMER which might be relevant to your sales fact data. There would likely also be reports that solely rely on the CUSTOMER dimension - like a "new customer by month" report. You wouldn't expect this to be in the fact data. The PRODUCT dimension is another I would almost always put in it's own dimension.

  1. If the dimension is a single value with no other useful attributes connected to it, I may consider it for inclusion in a fact table.

Example: We might have an attribute called "Order Source Channel" - which might be a single value describing where the sales order came from e.g. it might have values like eCommerce, Kiosk, Point-of-Sale, Phone-In, etc. It's a single value, and no other related attributes exist for this entity. In this case, I am tempted to leave it in the fact table, rather than create a single-attribute dimension and require my users to do an additional join, etc.


Remember:

  • The above is a generality, and I don't treat is as a hard-fast rule
  • Data modeling is as much an art as it is a science. There are many scenarios which can go either way, and only experience will help you decide which way to go.
  • Usability of your DWH structure should be as much a consideration as is performance. I try to never create a data model that requires my power-users to write SQL queries with 15+ or more joins just to get at sales data. This will lead to someone writing incorrect SQL (it always does). This will sometimes be mistaken as "bad data". This is what you, as a DWH developer, don't want happening.