Related question about the same project here.
I'm looking at two approaches to model a hierarchical relationship between my fact and dimension tables in a data warehouse for storing IT infrastructure data.
Consider as an example:
FACT_Folder
- Contains facts about a file system folder
Dim_Drive
- A folder links to a single DRIVE entry
Dim_Array
- A drive links to a single array
Dim_Server
- An array links to a single server
Dim_Farm
- A Server links to a single Farm
For our purposes we don't need to worry about a folder moving to a different drive.
We want to be able to get metrics at all the levels of this hierarchy.
Assuming I will eventually have millions of Folder entities (with daily data), and hundreds or thousands of drives, arrays, etc; which design pattern would you recommend and why?
-
Normalized
- A
Folder
links to aDrive
, and the rest of the hierarchy is between dimensions
- A
-
Denormalized I
- A
Folder
links to aDrive
, but theDrive
dimension contains the whole hierarchy for eachDrive
entry for every row
- A
-
Denormalized II
- A
Folder
links directly to all levels of the hierarchy
- A
-
??????
Best Answer
I would link at all/most levels. This denormalized star means that yes, the data is redundant, but it typically makes the reporting and analysis a lot easier. Note that this is very different from OLTP normalization, and you don't typically have to worry about redundant data getting out of sync because in a DW scenario data never changes. New facts get added and dimensions get expired and new ones created.
I don't see a Dim_Folder. I would assume that the actual path of the folder would be an attribute of the Dim_Folder. Only the numeric quantity and any degenerate dimensions (http://en.wikipedia.org/wiki/Degenerate_dimension) would be in the fact table. I wouldn't think of the folder path as a degenerate dimension because it keeps coming back in each snapshot (an a folder isn't a transaction).
So you could do something like this:
See how the DIM_Folder usage makes the set of dim ids small and then, we're assuming some kind of index on snapshot date and then folder dim id (or vice versa).
See how you also now don't need to join on folder at all if you just want the data at a higher level. Since you usually know all this at ETL time, there is a different motivation than in OLTP systems where you want everything to move together when something is changed (leg bone connected to the thigh bone, etc.). In DW scenario, you really don't want anything to move.
So, bam! - total Farm usage analysis:
Remember stars are really simple for analysis. You NEVER need to worry about inadvertent cross joins in a single non-snowflaked star. When linking different stars, you DO have to watch out. So queries in MOST cases are MUCH simpler in star-schemas. No network traversal and worrying about many-many relationships like in a normalized model.