Dealing with different types in a fact table

data-warehousedatabase-designdimensional-modelingolapstar-schema

I'm working with dimensional modelling for the first time, trying to build a data warehouse that pulls in data from an OLTP database, and I'm having some trouble figuring out what to do with this type of scenario. The database tracks donations into a foundation. I'd like to have a fact table such that the "grain" is a single donation, i.e. at the "transaction" level. Donations can come from different types of donors. Let's say for example they come from individuals and companies, although in practice there are actually four relevant categories. The source data has tables for each of these donor types–so there is an individuals table and a companies table. Individuals and companies have very different attributes, and a typical reporting requirement will be to look at aggregate donations for all types, and then do deeper dives into donations by donor type. My first thought was to build a schema like so

(the colour is meaningless btw, I just happened to take the screenshot when the highlighted table was selected)

Since companies and individuals have such different attributes, the idea was to keep them in their own tables and have them be dimensions. (Remember that there are actually 4+ types like this in the real case).

But this is apparently not a good way to go because that means that there will be lots of NULL foreign keys in the fact table. I have read that this is a no-no (although I'm not sure why).

Another option would be to combine the Companies and Users tables into one big Donors table, which might look like this

This eliminates the NULL foreign keys from the fact table at the expense of simplicity in the dimension table. The Donor dimension table is now very wide, and contains a lot of null attributes (again, remember that there are 4+ types of donors that would be represented, each with its own unique attributes).

Finally, there is the option of having a Donor dimension that references Company and Individual tables. Here's how that might look:

Now all the null foreign keys are in the donor dimension table. Obviously, however, this is no longer a star schema.

I've been searching everywhere I can think of to find some general principles for dealing with this type of situation, but I'm coming up short. I'm just wondering if there is a standard way to deal with this since it seems like it ought to be a standard problem in data warehousing.

Best Answer

I would steer clear of the 3 option. It's called a snowflake instead of a star schema. It's a rather advanced solution which can be used when necessary but has it's own drawbacks. As usual refer to Kimball when looking for datawarehouse design tips.

This is what they say in Snowflakes, Outriggers, and Bridges:

We generally encourage you to handle many-to-one hierarchical relationships in a single dimension table rather than snowflaking. Snowflakes may appear optimal to an experienced OLTP data modeler, but they’re suboptimal for DW/BI query performance. The linked snowflaked tables create complexity and confusion for users directly exposed to the table structures; even if users are buffered from the tables, snowflaking increases complexity

When it comes to 1 or 2, I would say it's up to the reporting requirements and how well the dimension members (the records in the dimension) should be seen as one "axis" to report against. There can be benefits to combining multiple types of records in a single dimension (such as defining hierarchies and attribute relationships to improve performance) but in this case I would lean towards seperate dimensions for different entities.
Again, Kimball explains this in the basics of dimensional modeling but they have more articles on the subject.

Rule #1: Load detailed atomic data into dimensional structures.

Rule #2: Structure dimensional models around business processes.

Rule #8: Make certain that dimension tables use a surrogate key.

When it comes to NULL values in your fact table foreign keys, those should be solved by using surrogate keys which you should use anyway.

I answered a related question here but it basically comes down again to Selecting Default Values for Nulls which explains how you should handle missing data in your dimensions.

The first scenario where nulls should be avoided is when we encounter a null value as a foreign key for a fact table row during the ETL process. We must do something in this case because an actual null value in a foreign key field of a fact table will violate referential integrity;

In some way you can interpret that as "NULLS are a no-no" but it actually means "insert a dummy record such as N/A in your dimension, and use the surrogate key of that record in your fact tables foreign key field"

Best Answer

Related Solutions

Fact table foreign keys null

Sql-server – Dealing with measure groups of different granularity

Related Question