Data into Star Schema’s fact or dimension table

business-intelligencedata-warehousereportingstar-schema

I'm working on creating a Star Schema for my internship that I want to use for reporting on the company's pre-sales process. The aspects of the pre-sales process that I want to report on, are:

  • Sales proposals;
  • Task management.

The thing that I'm having trouble with is whether or not to separate proposals and tasks from one another in my Star Schema. For this reason, I've now created two different ones and I wonder if any of them is the "correct" way of doing it.

Star Schema 1, in which the proposal data has been separated from the fact table and put in a dimension table:
Star Schema 1

Star Schema 2, in which the proposal data and task data have been combined in the fact table:
Star Schema 2

These are the metrics that I eventually want to be able to report:

  1. % of tasks finished on time
  2. Workload over time
  3. Due and completed proposals
  4. Number of active submissions
  5. Potential total sales
  6. Potential total gross profit
  7. Potential average gross profit % per sales team
  8. % of submissions converted to orders

Which of these schema's do you guys think is better? I've been trying to wrap my head around this for some while now, but I can't figure it out. Star Schema 1 has less redundancy, but does have values that I want to use in a dimensional table. Star Schema 2, on the other hand, has a lot of redundancy because proposals can have multiple tasks (sometimes up to 20).

Also, a more general question regarding my schema's:
– Do all relationships have to be visualised in the Star Schema Diagram? For example: all dates in every table must be linked to the date table. And: all references to employees must be linked to the employee table.

P.S.
The application and its database are still being developed so I have no data that I can use to test it.

Best Answer

This is a very interesting, but perhaps also a bit too general as a question.

My general opinion is that if the cardinality of a dimension starts to approach that of the fact table, then there's no point in keeping a separate dimension. If you read Kimball carefully you'll see that on many occasions there's an implicit assumption that dimensions are small when compared to the fact table (by a few orders of magnitude at least).

In your case, you mention a proposal may have up to 20 tasks; you're pretty much on the verge of having a dimension that starts to be too large to be efficiently handled.

However, there's another thing you may have missed. What you're trying to achieve is counting proposals on any given state of their lifecycle. Again, if you go to Kimball, the solution would be an accumulating snapshot fact table. But this comes at a cost, which is the need to define beforehand how many milestones of an object's lifecycle it is worth tracking.

There's an alternative, though, which I developed a while back and blogged about last week, which is what I call a Status Change Fact table, where you track events (in your case, tasks) and add rows to mark each state change in the proposal's lifecycle (blog post here). The basic idea is that if a proposal has two tasks, A and B, that took place on days D1 and D2, you insert three rows on your fact table:

date;task;count D1;A;+1 D2;A;-1 D2;B;+1

This model, combined with a periodic snapshot to count how many of which are in each state at any given time, should allow you to run simple queries and still get meaningful answers to your questions.

I've implemented this approach in a couple projects recently where objects are tracked through a system, with a good degree of customer satisfaction, even though customers had to be educated to what negative counts in the status change table mean.