If a DW fact table is not uniquely identified by all dimensions, what problems

data-warehousedatabase-designdimensional-modelingfacttable

This is a bit of a thought problem I've been wrestling with. I have a visceral antipathy to the notion of duplicate combinations of dimension values in a fact table. I've read a lot about the existence of problems when the combination of dimensions in a fact table does not form a unique key. However, I want to understand the exact types of analysis failures that can arise.

Note that I am going to stipulate, in advance, that the hypothesized ugly fact table has data that is all at the same grain. Let's say that all sales are reported uniquely, but the finest granularity for the time of the sale is a day. Obviously there will be transactions sharing the same combination of dimension values. So this approach does not summarize the transactions for each day, the way good practice would ordinarily dictate.

I think simple DW queries with standard aggregation would still be correct. By "simple," I mean that only a single fact table is referenced in the query. In the usual form of aggregating/analyzing measures, I think the queries would produce correct results.

One failure case would arise from trying to select unique fact rows by combining all the dimensions. I believe these sorts of queries are practically unknown; I see little use for them, unless the user wants to really drill down to the finest level on all dimensions. Am I correct in thinking this?

The only predictable and common failure case I can see arises from cross-fact queries. Here the extra cardinality would presumably multiply any measures used in any fact table.

With my students (and in my corporate jobs), I often get asked "what bad will happen if I don't follow this rule?" Right now I worry I don't have all the answer.

Thanks in advance for your thoughts and your answers.

Best Answer

I don't think you will run into a lot of problems when querying your data warehouse except for the one you already stated (cross fact queries) although if you are aware of how your design is done and only aggregate over dimensions I think you should still be able to query that. You couldn't join on the dimension keys in the fact table but you could still aggregate.

If you only care about aggregations on a daily granularity there is no need to have a dimension with a finer granularity in my opinion.

Where you might run into problems is if you try to do incremental loads or update fact records in your ETL instead of doing a full load but there are ways around that.

  • You could use a fact table surrogate key as Kimball describes.
  • You could use the technique described by Vincent Rainardi
  • You could include "supporting" columns in your fact table that you never report on but support your ETL strategies (in your case timestamp maybe)