Should I Link a Fact to Hierarchical Dimension at ALL Levels or only the most granular

data-warehousedatabase-design

Related question about the same project here.

I'm looking at two approaches to model a hierarchical relationship between my fact and dimension tables in a data warehouse for storing IT infrastructure data.

Consider as an example:

  • FACT_Folder
    • Contains facts about a file system folder
  • Dim_Drive
    • A folder links to a single DRIVE entry
  • Dim_Array
    • A drive links to a single array
  • Dim_Server
    • An array links to a single server
  • Dim_Farm
    • A Server links to a single Farm

For our purposes we don't need to worry about a folder moving to a different drive.

We want to be able to get metrics at all the levels of this hierarchy.

Assuming I will eventually have millions of Folder entities (with daily data), and hundreds or thousands of drives, arrays, etc; which design pattern would you recommend and why?

  1. Normalized

    • A Folder links to a Drive, and the rest of the hierarchy is between dimensions
  2. Denormalized I

    • A Folder links to a Drive, but the Drive dimension contains the whole hierarchy for each Drive entry for every row
  3. Denormalized II

    • A Folder links directly to all levels of the hierarchy
  4. ??????

Best Answer

I would link at all/most levels. This denormalized star means that yes, the data is redundant, but it typically makes the reporting and analysis a lot easier. Note that this is very different from OLTP normalization, and you don't typically have to worry about redundant data getting out of sync because in a DW scenario data never changes. New facts get added and dimensions get expired and new ones created.

I don't see a Dim_Folder. I would assume that the actual path of the folder would be an attribute of the Dim_Folder. Only the numeric quantity and any degenerate dimensions (http://en.wikipedia.org/wiki/Degenerate_dimension) would be in the fact table. I wouldn't think of the folder path as a degenerate dimension because it keeps coming back in each snapshot (an a folder isn't a transaction).

So you could do something like this:

SELECT AVG(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Folder
    ON FACT_Folder.FolderDimID = DIM_Folder.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
    AND DIM_Folder.FolderPath = '/usr/bin/'

See how the DIM_Folder usage makes the set of dim ids small and then, we're assuming some kind of index on snapshot date and then folder dim id (or vice versa).

See how you also now don't need to join on folder at all if you just want the data at a higher level. Since you usually know all this at ETL time, there is a different motivation than in OLTP systems where you want everything to move together when something is changed (leg bone connected to the thigh bone, etc.). In DW scenario, you really don't want anything to move.

So, bam! - total Farm usage analysis:

SELECT DIM_Farm.Farm_Name, SUM(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Farm
    ON FACT_Folder.FarmDimID = DIM_Farm.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
GROUP BY DIM_Farm.Farm_Name

Remember stars are really simple for analysis. You NEVER need to worry about inadvertent cross joins in a single non-snowflaked star. When linking different stars, you DO have to watch out. So queries in MOST cases are MUCH simpler in star-schemas. No network traversal and worrying about many-many relationships like in a normalized model.