Should I create a Factless Fact Table or Denormalise the table

database-designdimensional-modeling

I am creating a data warehouse from a store database and I have a question regarding the design of my dimensions and facts.

In the store database a table exists for Person, Person_Address and Person_Address_Type. These are linked by another tabled name Entity_Address_ID which links the three tables by their primary keys to give details on what a person's address is and what type of address it is.

My question is, should I create a dimension for all three tables, and a fact-less fact table to link them together or should I de-normalise my dimensions and add to each dimension a foreign key for the address and address type they are linked too?

Here is a very quick UML of what the current database looks like to provide clarification
UML of Problem

Best Answer

Merge them. Sometimes it's worthwhile to have a separate Addresses table, but generally it's simplest to make those fields of a single Customers dimension. Because addresses are nearly unique, you're not going to save many records by hiving them out to their own table.

If your Customers table is really large, and addresses change relatively often then moving Addresses out as a Type-4 SCD could be a worthwhile optimization, but a small one.

If there's interest in analyzing customers independently of their specific sales (e.g., "20% of our top-100 customers are international"), you can absolutely add a NumCustomers dummy column to this Customers table to turn it into a factless fact table.

Related Solutions

Should I Link a Fact to Hierarchical Dimension at ALL Levels or only the most granular

I would link at all/most levels. This denormalized star means that yes, the data is redundant, but it typically makes the reporting and analysis a lot easier. Note that this is very different from OLTP normalization, and you don't typically have to worry about redundant data getting out of sync because in a DW scenario data never changes. New facts get added and dimensions get expired and new ones created.

I don't see a Dim_Folder. I would assume that the actual path of the folder would be an attribute of the Dim_Folder. Only the numeric quantity and any degenerate dimensions (http://en.wikipedia.org/wiki/Degenerate_dimension) would be in the fact table. I wouldn't think of the folder path as a degenerate dimension because it keeps coming back in each snapshot (an a folder isn't a transaction).

So you could do something like this:

SELECT AVG(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Folder
    ON FACT_Folder.FolderDimID = DIM_Folder.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
    AND DIM_Folder.FolderPath = '/usr/bin/'

See how the DIM_Folder usage makes the set of dim ids small and then, we're assuming some kind of index on snapshot date and then folder dim id (or vice versa).

See how you also now don't need to join on folder at all if you just want the data at a higher level. Since you usually know all this at ETL time, there is a different motivation than in OLTP systems where you want everything to move together when something is changed (leg bone connected to the thigh bone, etc.). In DW scenario, you really don't want anything to move.

So, bam! - total Farm usage analysis:

SELECT DIM_Farm.Farm_Name, SUM(bytes_on_disk)
FROM FACT_Folder
INNER JOIN DIM_Farm
    ON FACT_Folder.FarmDimID = DIM_Farm.DimID
INNER JOIN DIM_Date
    ON FACT_Folder.SnapshotDateID = DIM_Date.DateID
WHERE DIM_Date.Date BETWEEN '20120101' AND '20121231'
GROUP BY DIM_Farm.Farm_Name

Remember stars are really simple for analysis. You NEVER need to worry about inadvertent cross joins in a single non-snowflaked star. When linking different stars, you DO have to watch out. So queries in MOST cases are MUCH simpler in star-schemas. No network traversal and worrying about many-many relationships like in a normalized model.

Sql-server – Dealing with measure groups of different granularity

If you want to see measures from two different fact tables side by side but different granularity, it's possible but not perfect. For example, the measure for purchase amount in FactPurchase has no monthly equivalent.

To do this, it helps to have as many related dimensions between both fact tables. So if you have a dimension such as property type that is related to FactPurchase, I would make that a dimension on FactMonthlyStatus as well. This can be done by adding these to the actual FactMonthlyStatus table and figuring them out in your ETL process or by creating a view that sits on top of FactMonthlyStatus that does a join to lookup these dimensions as they are loaded into your cube. However, you are still going to have dimensions such as month that are at a lower level of granularity than FactPurchase.

The main thing to figure out is how your measures in both fact tables relate to each other. Are you attempting to keep a running total for rent paid and see how it relates to what you paid for the property? If that's the case, it might make sense to create a cube based off a single view that combined the FactPurchase with an aggregate of FactMonthlyStatus to see how they compare.

Best Answer

Related Solutions

Should I Link a Fact to Hierarchical Dimension at ALL Levels or only the most granular

Sql-server – Dealing with measure groups of different granularity

Related Question