Sql-server – data warehouse design help

data-warehousedatabase-designsql serversql-server-2012

I'm struggling with the design phase for a data warehouse. It's not the traditional sales example you find on MSDN or all books etc. I've been tasked with creating an education data warehouse, that tracks student progress over time. There is nothing that quite matches it in Kimball's books either. This is what I've come up with so far. I'm following best practices i.e. Creating primary identity keys on each table and including the business key which I may attempt to make an unique key depending on the quality of data (whether referential integrity can be kept)

DimStudent

  • StudentDwKey
  • StudentBizKey

DimSubject

  • SubjectDwKey
  • SubjectBizKey

DimTime or date

  • DateKey

Different format dates, quarters etc. etc.

DimCollege

  • CollegeDwKey
  • CollegeBizKey

DimTeacher or tutor

  • TeacherDwKey
  • TeacherBizKey

FactGrades

  • FK_StudentDwKey
  • FK_SubjectDwKey
  • FK_CollegeDwKey
  • FK_TeacherDwKey
  • Grades

(Obviously there is many more fields in the dimension tables, but just for simplicity)

Student and Subject both have the same business key. Now I'm really struggling with the fact table. The exam grades are stored in the Subject business table for some reason. The ETL I've created splits the grades from the subject table so DimSubject just contains subjects and FactGrades just contains Grades with foreign keys from other tables. The problem I'm having is the relationships and whether or not I need to create multiple fact tables. There are other grades they want analysing that are stored in different places for example i.e. Predicted grades, or admission exam grade. So should I create a fact table for each one, or try and merge them all into one fact table? Also if the relationships are defined by the business key, should I not be creating foreign key relationships between the business keys and not the surrogate keys? So in short.

  • Do I define relationships by newly created keys or business keys.
  • Do I create multiple fact tables for each type of grade if they are in different parts of the business database or put them all in one table.
  • Can I link one dimension to multiple fact tables or does this cause duplication issues.
  • How do I define relationship to date dimension, should this relationship be between the fact table and what do I join on? I have an exam start date and end date but then I will have different dates for other grade types.

Best Answer

Here are my answers:

  1. Define relationships based on your surrogate keys and not your business keys. This is because this is what the SQL engine will use to optimize queries and this is the reason you have surrogate keys. Since surrogate keys are often integers, this simplifies the joins from using different data types such as strings which are less efficient or even using composite keys. It might be worth defining nonclustered indexes on your business keys for lookups during your ETL process, but queries against the data warehouse should utilize your surrogate keys.
  2. Here's how I would approach your fact tables. If the different grades are the same granularity and have most if not all of the same dimensions in common, I would put them in the same fact table. Otherwise, I would put them in separate fact tables. You can always define cubes with multiple measure groups or put a view in place that combines the different fact tables into one that makes sense down the road.
  3. You definitely can link one dimension to multiple fact tables. This is referred to as a conformed dimension and is usually more common that linking a dimension to a single fact table.
  4. Typically, I create a date dimension that is at a single day granularity and link it to any fact table that has a date field in it. However, in an education context, you might have the need for a different granularity such as just assigning a fact to a quarter dimension instead. If that's the case, you could build out a separate dimension table at the desired granularity. However, for your example of an exam, you would just link that exam record to the date dimension based on that specific day. Furthermore, you should build out your date dimension to have fields such as quarter, school year, etc. One other thing to consider in your example is that it might be useful to calculate the duration of the exam based on the start and end dates if they are in a date time format and you want to know how long it took student to complete a specific exam and to see if duration correlates with results in some way.

Let me know if you have any follow up questions!