What are the difference between a “dimension” table in a star-schema and a “lookup” table in a relational database

data-warehousedatabase-designdimensional-modelingstar-schema

I'm trying to design a star-schema fact table along with some dimension tables that will surround it. If I reuse the natural key called customer_key in both the fact_table and dim_customer then I don't see a difference between calling something dim and a lookup table. Additionally if I every need to update the customer_name I will loose historical data that represented this fact at the time of the recording. What am I missing when trying to model the dim and fact tables?

I would like to understand difference between relational "lookup" table technique and data warehouse "dimension" table?

fact_table               dim_customer          dim_product
-------------            ------------          -----------
customer_key             customer_key          product_key
product_key              name                  name
units_sold               email                 description
unit_price

Please forgive any ignorance I might show in this question. I am a data warehouse newbie.

Best Answer

If you only have those tables there will be little difference between your datawarehouse star schema and your actual schema.

You probably have a much more complex relational schema though, where you also have customer groups or item types and your schema looks more like

fact table -> customer table -> customer group table
fact table -> item table -> item type table

When you are working towards a star schema you would then denormalize your schema so you include the customer group description in your customer dimension table and your item type description in your item dimension table.

The basis of a star schema is that you have a fact table, and every dimension is a single table just a single step away from the fact table.

The way kimball describes it:

Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/foreign key relationships.

If you don't denormalize your dimensions into single dimension tables you end up with a snowflake schema. Snowflakes may seem like a good idea, but they often result in poorer query performance and can get you in all kinds of trouble later on, especially if you plan to build a cube on top of your data warehouse.

Again Kimball has a post on it:

We generally encourage you to handle many-to-one hierarchical relationships in a single dimension table rather than snowflaking. Snowflakes may appear optimal to an experienced OLTP data modeler, but they’re suboptimal for DW/BI query performance. The linked snowflaked tables create complexity and confusion for users directly exposed to the table structures; even if users are buffered from the tables, snowflaking increases complexity for the optimizer which must link hundreds of tables together to resolve queries. Snowflakes also put burden on the ETL system to manage the keys linking the normalized tables which can become grossly complex when the linked hierarchical relationships are subject to change. While snowflaking may save some space by replacing repeated text strings with codes, the savings are negligible, especially in light of the price paid for the extra ETL burden and query complexity.

With regards to your question about changing the customer data and losing history, in a datawarehouse that's solved by using "slowly changing dimensions".

You basically have several types of slowly changing dimensions, but you are probably looking for "type 2" which largely comes down to storing a valid_from and valid_to date on your dimension row so you can select the row that was valid at the time of the transaction.

Type 2 SCD's

In addition to the primary surrogate key, I recommend adding five additional fields to a dimension that is undergoing Type 2 processing. These fields are shown in Figure 1. The datetimes are full time stamps that represent the span of time between when the change became effective and when the next change becomes effective. The end-effective-datetime of a Type 2 dimension record must be exactly equal to the begin-effective-datetime of the next change for that dimension member.

It's a broad subject to cover but I hope this sets you on the right track for the basics and you can use the correct terminology in the future.

The kimball site is a great resource (and many call it the reference) for dimensional modeling so I suggest you browse the glossary and the design tips section

Best Answer

Related Solutions

What are some ways to implement a many-to-many relationship in a data warehouse

Should I snowflake or duplicate it across the facts

Related Question