Data Warehouse – Dimension and Fact for Same Entities Explained

data-warehousedimension

I'm rather new at DW design and am working on a DW to model some IT infrastructure.

The major problem/question at this point is how to model drive information.

We will be collecting aggregate data on files and folders, and separate data on physical drives. Drive info will include total and free space at a minimum and will be updated several times a week.

One of the business questions that will need to be answered is how drive usage is trending over time. The drive information will also be used in a hierarchy leading down to the file/folder level as well.

The options that I can see now are:

  1. Implement DRIVE as a Dimension

    • Simplifies hierarchy design
    • Will this cause issues with reporting? It seems counter-intuitive to me to report time-bound data on a dimension only
    • It also seems problematic to have a dimension that you KNOW will change every time you refresh your data
  2. Implement DRIVE as a Fact Table

    • Simplifies reporting
    • Complicates hierarchy(?) – I'll be using Drive to map data back to a specific server or computer as well. Is it OK to use a fact table as an intermediate level in a hierarchy? I don't think it is.
  3. Implement DRIVE as both a Fact and Dimension

    • Fact will contain just the key, date, and facts on space
    • Dimension will include other non-additive data like what computer it's on, etc.
    • Seems to resolve both issues, but is this an anti-pattern?

Best Answer

I expect I would have a drive_usage fact table with a link to a snapshot time dimension, a drive dimension, a computer dimension and the various numerical facts about the drive at that instant in time.

There should probably be nothing regularly changing in the drive dimension - I guess it depends upon your definition of drive - is it a physical drive or a logical unit or what. Perhaps your "C" drive has a serial number, and it's replaced - then the dimension will expire and a new dimension is added. These things about a dimension are not really "facts", they are attributes. This wouldn't affect reporting because the data for computer X, drive C has continuity. Similarly if computer X is upgraded from dual core to quad core and so there is a change to the dimension (assuming something beyond number of cores is not tracked in a fact table, like a motherboard revision). The capacity of a drive would be in the fact table, so changes to that over time are just new facts with new dates. Sometimes you can even model changes to membership as facts. i.e. if physical drives 1-5 are in logical drive C one day and then physical drives 1-6 are in logical drive C the next, that might just be a fact change in the physical drive membership fact table. These are what some people call factless fact tables, since the only fact is the existence of the row shows membership - there is not much to be done except totalling or counting.

When you get into folders, modeling the hierarchy can be a lot more tricky depending upon what you are trying to achieve with rollups.

There is a lot of art to DW modeling in domains which aren't run-of-the-mill scenarios.