Database Design – Hierarchy for Dimensional Model

data-warehousedatabase-design

I am trying to model a DW where I have many levels of geography (Neighborhood, District, City, State).

I have several demographic data that should be included in the model. These data include count of people living in that area, average monthly income, average age, and others. The lowest level I have data for is neighborhood, which means it can be grouped together in order to calculate the values for the upper levels.

The Geography is modeled in a single (denormalized) dimension, where each level gets its own column.

Now I have to fit the demographic data in the model. Should I put it in another, separate dimension, or should I put it in the Geography dimension? What about the aggregated levels?

Putting it all on the Geography dimension would leave me with a large number of columns:

  • Id
  • Neighborhood Name
  • District Name
  • City Name
  • State Name
  • Neighborhood Avg Monthly Income
  • Neighborhood Avg working population age
  • Neighborhood Number of people
  • District Avg Monthly Income
  • District Avg working population age
  • District Number of people
  • City Avg Monthly Income
  • City Avg working population age
  • City Number of people

Is this correct? This seems rather convoluted. I searched for alternative designs on several books (including Kimball's), but haven't been able to find anything satisfactory.

Are there better, proven, approved and reliable alternatives to this design?

Best Answer

My take on this is that you may perhaps want to try using a parent-child dimension, so the model will allow for a theoretically infinited amount of levels. That way, you can connect facts to each level of the hierarchy.

Here's an example:

CREATE TABLE dimGeography (
    id
    parent_id
    geographyType
    geographyCode
    geographyName
    PRIMARY KEY (id)
);

CREATE TABLE factDemographics (
    date
    geography_id
    monthlyIncome
    workingPopulationAge
    numberOfPeople
    PRIMARY KEY (date, geography_id)
    FOREIGN KEY (geography_id) REFERENCES dimGeography (id)
);

The advantages of this approach are:

  • you're free to add or remove levels as you like over time without changing the data model
  • you can connect facts (demographics) to each level
  • no duplicate measure columns, i.e. just one column called "number of people"
  • future-proof: some countries or regions may use different types of levels - for instance, not all countries use "state".

The disadvantages that I can think of are:

  • you may have to redesign your existing (?) dimension or fact table
  • your BI tool may not be able to handle parent-child dimensions
  • more complex SQL queries when aggregating the facts on higher levels

I haven't the slightest idea if this is the "correct" solution according to Kimball, Inmon, etc. Instead, I would urge you to see if it applies to your particular needs. ;)