Sql-server – Dimension Help – Deciding Fact or Dimension

database-designslowly-changing-dimensionsql serverssas

We have the following Dim and Facts:

Customer Dim: SCD Type 2, Info about the customer ie first purchase date, name, address, etc
Product Dim: SCD Type 2, about our products

Customer Snapshot Fact: Monthly Fin facts about the Customer
Product Sales Fact: Sales by Customer

and will have many more facts that involve the customer Dimension.

We have a legacy DB that was collecting 100's of data fields about a customer and have been asked to DW this data. There are 100 plus fields related to a customer and may of them are flags that indicate if the customer qualifies for something or not. The majority of the queries the users will want against any of our fact table may include the filtering and or grouping of these indicators.

The question is should we add 100 plus indicators to our Customer Dim and if not how should the data be structured so this information can be joined with all of our other facts.

Thanks for your help.

Best Answer

Put the indicators on the customer dimension.

This means that any fact table that joins against the customer dimension has access to all of the indicators. If they are on the dimension then you can trivially make them available to any fact table that links to the customer dimension.

If you just need a fact table with counts of customers that roll up by the dimension attributes, then you can create a 'factless fact table' that just has a single fact - a 'QTY' column with a value of 1. This allows counts of customers to be grouped by any of the attributes.

If Customer is a slowly changing dimension, consider putting an additional row with -1 in the QTY column into the fact table every time a type 2 change is made. This should link to the previous version of the dimension, with the current version having an additional row with a 'QTY' of 1. This allows you to track statistics on changes in customer attributes over time.

A cube can consume this change over time for the counts by implementing a calculated measure that does a running sum in 'QTY' from beginning up to the selected date. If necessary you could also build a snapshot table.

If the number of attributes and volume of changes becomes unwieldy then you have what Kimball calls a 'Rapidly Changing Monster Dimension' in his first book. In this case, consider pulling the attributes out into a separate junk dimension that has a row for each distinct combination of values. You will still want some scaffolding to link this to the actual dimension rows so you can copy the junk dimension key onto any fact tables you have the customer dimension key on.

A cube can consume either structure, so it is less likely to be an issue for the cube so much as an issue for the ETL processing.

Related Solutions

Should I Link a Fact to Hierarchical Dimension at ALL Levels or only the most granular

I would link at all/most levels. This denormalized star means that yes, the data is redundant, but it typically makes the reporting and analysis a lot easier. Note that this is very different from OLTP normalization, and you don't typically have to worry about redundant data getting out of sync because in a DW scenario data never changes. New facts get added and dimensions get expired and new ones created.

I don't see a Dim_Folder. I would assume that the actual path of the folder would be an attribute of the Dim_Folder. Only the numeric quantity and any degenerate dimensions (http://en.wikipedia.org/wiki/Degenerate_dimension) would be in the fact table. I wouldn't think of the folder path as a degenerate dimension because it keeps coming back in each snapshot (an a folder isn't a transaction).

So you could do something like this:

SELECT AVG(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Folder
    ON FACT_Folder.FolderDimID = DIM_Folder.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
    AND DIM_Folder.FolderPath = '/usr/bin/'

See how the DIM_Folder usage makes the set of dim ids small and then, we're assuming some kind of index on snapshot date and then folder dim id (or vice versa).

See how you also now don't need to join on folder at all if you just want the data at a higher level. Since you usually know all this at ETL time, there is a different motivation than in OLTP systems where you want everything to move together when something is changed (leg bone connected to the thigh bone, etc.). In DW scenario, you really don't want anything to move.

So, bam! - total Farm usage analysis:

SELECT DIM_Farm.Farm_Name, SUM(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Farm
    ON FACT_Folder.FarmDimID = DIM_Farm.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
GROUP BY DIM_Farm.Farm_Name

Remember stars are really simple for analysis. You NEVER need to worry about inadvertent cross joins in a single non-snowflaked star. When linking different stars, you DO have to watch out. So queries in MOST cases are MUCH simpler in star-schemas. No network traversal and worrying about many-many relationships like in a normalized model.

Should I snowflake or duplicate it across the facts

The "header - detail" pattern is very common the domain of sales transactions.

To answer your question, there are so many factors that will come into play which you've not discussed. For example:

If your DW infrastructure has a great deal of RAM and is on SSD storage, reads in this case are cheap, so it might make sense to denormalize some dimensions in the interest of usability.
What are the use-cases of the data? In this case I can probably make assumptions - it is sales data. It'll be used for accounting, executive reporting, predictive analysis, customer service, and for just about every possible ad-hoc query you can imagine.

One general principle I use when deciding whether to snowflake a dimension or simply include it's value in the fact table is this:

If the dimension has many attributes which might be useful for reporting (or if there will be report(s) solely on that dimension), I create a dimension for it.

Example: Consider the CUSTOMER dimension. A sales order has a customer, but there are other attributes which belong with the CUSTOMER dimension which you might want to report on, like customer location, customer age/sex/marital status, customer type, customer create date, etc., and many other customer-related attributes. I wouldn't put all of these in a fact table, so in this case I "snowflake" to a customer dimension as there are many more attributes related to CUSTOMER which might be relevant to your sales fact data. There would likely also be reports that solely rely on the CUSTOMER dimension - like a "new customer by month" report. You wouldn't expect this to be in the fact data. The PRODUCT dimension is another I would almost always put in it's own dimension.

If the dimension is a single value with no other useful attributes connected to it, I may consider it for inclusion in a fact table.

Example: We might have an attribute called "Order Source Channel" - which might be a single value describing where the sales order came from e.g. it might have values like eCommerce, Kiosk, Point-of-Sale, Phone-In, etc. It's a single value, and no other related attributes exist for this entity. In this case, I am tempted to leave it in the fact table, rather than create a single-attribute dimension and require my users to do an additional join, etc.

Remember:

The above is a generality, and I don't treat is as a hard-fast rule
Data modeling is as much an art as it is a science. There are many scenarios which can go either way, and only experience will help you decide which way to go.
Usability of your DWH structure should be as much a consideration as is performance. I try to never create a data model that requires my power-users to write SQL queries with 15+ or more joins just to get at sales data. This will lead to someone writing incorrect SQL (it always does). This will sometimes be mistaken as "bad data". This is what you, as a DWH developer, don't want happening.

Best Answer

Put the indicators on the customer dimension.

Related Solutions

Should I Link a Fact to Hierarchical Dimension at ALL Levels or only the most granular

Should I snowflake or duplicate it across the facts

Related Question