I have an accumulating snapshot fact table that tracks the entry and exit of containers in a terminal.
The containers can enter and exit in 3 different ways, so I thought to create a specific dimension table that lists these 3 possible ways (train, vessel or truck).
Then I read this article which basically says that this technique is wrong, but I can't understand why.
First article:
Sometimes when a fact table has a long list of facts that is sparsely populated in any individual row, it is tempting to create a measure
type dimension that collapses the fact table row down to a single
generic fact identified by the measure type dimension. We generally do
not recommend this approach. Although it removes all the empty fact
columns, it multiplies the size of the fact table by the average
number of occupied columns in each row, and it makes intra-column
computations much more difficult. This technique is acceptable when
the number of potential facts is extreme (in the hundreds), but less
than a handful would be applicable to any given fact table row.
I understand that if a "Measure Type Dimension" is implemented for a transaction fact table it can create problems like this other article says, but I can't see any downside if used for an accumulating snapshot fact.
Second article: (some downsides of implementing a "Measure Type Dimension")
- […] If we go with a "Measure Type Dimension" we will lose this analytic ability. If one measure is not compatible with the other
measures, we can’t add them up.- […] The more number of passes our SQL need to run to produce a report, the slower the report.
- […] On the BI tool, if you don’t put the measure type filter, you are risking the user getting “rubbish information”. From usability
point of view, this design is a rubbish.
Response to Mark Storey-Smith's answer
Very nice approach, i would have never thought about that.
Another thing: every entry and exit of a vehicle that brings container into the terminal has a unique ID which gives me other informations like: expected arrival of the vehicle, actual arrive, if it's a vessel the dock, if it's a truck the tollbooth and many other informations…
These are 3 different fact tables and they must be linked somehow to the container fact table.
I thought that the ID of the voyage is a degenerate dimension
, so it would directly go into the container fact table. So, my doubt is: should i add 6 different fields in the container fact table (vessel_voyage_in_key, vessel_voyage_out_key, train_voyage_in_key, train_voyage_out_key, truck_voyage_in_key, truck_voyage_out_key) or just 2 other fields (voyage_in, voyage_out) that dynamically links to the various voyage tables?
I hope my doubt is clear, thank you.
Best Answer
I believe the guidance is referring to a wide fact table where the majority of measure values are null:
The suggestion is that some people will see all the nulls and decide to do this instead:
Not good.
In your scenario I think I'd be looking at something like this, which is very different to the scenario described in the articles you referenced.
To the additional questions...
I would add
ExpectedEntryDate
,ExpectedExitDate
to theContainer/InventoryFact
. Less certain, without visibility of all the data elements, I would probably putEntryVoyageId
andExitVoyageId
in a separate junk dimension together as one row along with any other degenerate data items (identifiers for the truck, train etc).I would add 3 new dimensions for
VesselVoyage
,TruckVoyage
andTrainVoyage
and 6 Voyage keys (inbound/outbound) to this one fact (it's 6 new keys, not 6 additional rows). You then have the option of placingDock
andTollbooth
in the appropriate Voyage dimension. If you keep the generic data in these dimensions (VesselFlag
,TruckCapacity
) and the specific in a junk dimension (VesselName
,VesselMMSI
) they won't explode in size.