The "header - detail" pattern is very common the domain of sales transactions.
To answer your question, there are so many factors that will come into play which you've not discussed. For example:
- If your DW infrastructure has a great deal of RAM and is on SSD storage, reads in this case are cheap, so it might make sense to denormalize some dimensions in the interest of usability.
- What are the use-cases of the data? In this case I can probably make assumptions - it is sales data. It'll be used for accounting, executive reporting, predictive analysis, customer service, and for just about every possible ad-hoc query you can imagine.
One general principle I use when deciding whether to snowflake a dimension or simply include it's value in the fact table is this:
- If the dimension has many attributes which might be useful for reporting (or if there will be report(s) solely on that dimension), I create a dimension for it.
Example: Consider the CUSTOMER
dimension. A sales order has a customer, but there are other attributes which belong with the CUSTOMER
dimension which you might want to report on, like customer location, customer age/sex/marital status, customer type, customer create date, etc., and many other customer-related attributes. I wouldn't put all of these in a fact table, so in this case I "snowflake" to a customer dimension as there are many more attributes related to CUSTOMER
which might be relevant to your sales fact data. There would likely also be reports that solely rely on the CUSTOMER
dimension - like a "new customer by month" report. You wouldn't expect this to be in the fact data. The PRODUCT
dimension is another I would almost always put in it's own dimension.
- If the dimension is a single value with no other useful attributes connected to it, I may consider it for inclusion in a fact table.
Example: We might have an attribute called "Order Source Channel" - which might be a single value describing where the sales order came from e.g. it might have values like eCommerce
, Kiosk
, Point-of-Sale
, Phone-In
, etc. It's a single value, and no other related attributes exist for this entity. In this case, I am tempted to leave it in the fact table, rather than create a single-attribute dimension and require my users to do an additional join, etc.
Remember:
- The above is a generality, and I don't treat is as a hard-fast rule
- Data modeling is as much an art as it is a science. There are many scenarios which can go either way, and only experience will help you decide which way to go.
- Usability of your DWH structure should be as much a consideration as is performance. I try to never create a data model that requires my power-users to write SQL queries with 15+ or more joins just to get at sales data. This will lead to someone writing incorrect SQL (it always does). This will sometimes be mistaken as "bad data". This is what you, as a DWH developer, don't want happening.
This is a very interesting, but perhaps also a bit too general as a question.
My general opinion is that if the cardinality of a dimension starts to approach that of the fact table, then there's no point in keeping a separate dimension. If you read Kimball carefully you'll see that on many occasions there's an implicit assumption that dimensions are small when compared to the fact table (by a few orders of magnitude at least).
In your case, you mention a proposal may have up to 20 tasks; you're pretty much on the verge of having a dimension that starts to be too large to be efficiently handled.
However, there's another thing you may have missed. What you're trying to achieve is counting proposals on any given state of their lifecycle. Again, if you go to Kimball, the solution would be an accumulating snapshot fact table. But this comes at a cost, which is the need to define beforehand how many milestones of an object's lifecycle it is worth tracking.
There's an alternative, though, which I developed a while back and blogged about last week, which is what I call a Status Change Fact table, where you track events (in your case, tasks) and add rows to mark each state change in the proposal's lifecycle (blog post here). The basic idea is that if a proposal has two tasks, A and B, that took place on days D1 and D2, you insert three rows on your fact table:
date;task;count
D1;A;+1
D2;A;-1
D2;B;+1
This model, combined with a periodic snapshot to count how many of which are in each state at any given time, should allow you to run simple queries and still get meaningful answers to your questions.
I've implemented this approach in a couple projects recently where objects are tracked through a system, with a good degree of customer satisfaction, even though customers had to be educated to what negative counts in the status change table mean.
Best Answer
Ok, based on your very limited documentation, I would do the following:
Fact Tables - Your fact table is your measurement table. It is the thing that happened. It is the meeting together of dimension tables, typically at a point in time. In your diagram,
Dimension Tables - Dimension tables contain the textual context associated with a measurement event. It describes the “who, what, where, when, how, and why.” Based on your diagram:
Personally, I think you need to read up on Data Warehousing. The Data Warehouse Toolkit by Kimball is an invaluable resource.