Database Design – Hierarchy for Dimensional Model

data-warehousedatabase-design

I am trying to model a DW where I have many levels of geography (Neighborhood, District, City, State).

I have several demographic data that should be included in the model. These data include count of people living in that area, average monthly income, average age, and others. The lowest level I have data for is neighborhood, which means it can be grouped together in order to calculate the values for the upper levels.

The Geography is modeled in a single (denormalized) dimension, where each level gets its own column.

Now I have to fit the demographic data in the model. Should I put it in another, separate dimension, or should I put it in the Geography dimension? What about the aggregated levels?

Putting it all on the Geography dimension would leave me with a large number of columns:

Id
Neighborhood Name
District Name
City Name
State Name
Neighborhood Avg Monthly Income
Neighborhood Avg working population age
Neighborhood Number of people
…
District Avg Monthly Income
District Avg working population age
District Number of people
…
City Avg Monthly Income
City Avg working population age
City Number of people
…

Is this correct? This seems rather convoluted. I searched for alternative designs on several books (including Kimball's), but haven't been able to find anything satisfactory.

Are there better, proven, approved and reliable alternatives to this design?

Best Answer

My take on this is that you may perhaps want to try using a parent-child dimension, so the model will allow for a theoretically infinited amount of levels. That way, you can connect facts to each level of the hierarchy.

Here's an example:

CREATE TABLE dimGeography (
    id
    parent_id
    geographyType
    geographyCode
    geographyName
    PRIMARY KEY (id)
);

CREATE TABLE factDemographics (
    date
    geography_id
    monthlyIncome
    workingPopulationAge
    numberOfPeople
    PRIMARY KEY (date, geography_id)
    FOREIGN KEY (geography_id) REFERENCES dimGeography (id)
);

The advantages of this approach are:

you're free to add or remove levels as you like over time without changing the data model
you can connect facts (demographics) to each level
no duplicate measure columns, i.e. just one column called "number of people"
future-proof: some countries or regions may use different types of levels - for instance, not all countries use "state".

The disadvantages that I can think of are:

you may have to redesign your existing (?) dimension or fact table
your BI tool may not be able to handle parent-child dimensions
more complex SQL queries when aggregating the facts on higher levels

I haven't the slightest idea if this is the "correct" solution according to Kimball, Inmon, etc. Instead, I would urge you to see if it applies to your particular needs. ;)

Related Solutions

Is dimensional modelling useful for modelling tables of calculations

If you could meaningfully slice the data by key1, key2 or key3 then you could make dimensions out of them. In that case you would have separate dimension tables that had all the values of each key, and any relevant attributes (even just a human-readable description of what the value represents) or groupings (roll-ups of the key values).

Each dimension table would have the key value and its attributes, and would link to the fact table on its key.

If your values (Var X) are additive across rows (i.e. all the values added together for Key1='Foo' mean something) then you can use the keys to slice the data.

I don't think that relationships between the Var columns are really relevant to dimensional modelling unless you want to unpivot them and indentify them by a 'variable type' dimension rather than separate columns

Beyond that, the question is a bit vague to really get a clear view of what you are trying to achieve and why you want to make a dimensional model of your data in the first place. Maybe if you can clarify your goals a bit we can give you a more meaningful answer.

EDIT: Dimensional modelling is just identifying the axes by which you want to slice data in aggregate. Sometimes a dimension might be very simple - just a code. Sometimes it may be complex with lots of sub-attributes - such as an insurance policy. A dimension lets you slice data in aggregate, either by its key or by rolling up the data by some attribute of the dimension.

If the VarX columns are additive between rows - i.e it makes sense to aggregate (for example) var2 for two or more rows - then a dimensional structure might make sense for reporting. One key point of a dimensional model is that it tends to facilitate efficient queries in aggregate, and plays nicely with OLAP tools.

So, if I had values 'Foo', 'Bar', 'Wibble', 'Blarg' for key1 and values 'A', 'B', 'C' for key2 then I might have a few rows that looked a bit like:

Key1      Key2       Var1       [. . .]
--------- ---------- ---------- ----------
Foo       A          10         . . .
Bar       A          20         . . .
Foo       B          50         . . .

If I wanted to see the total of Var1 for all Key2='A' then the reporting tool would allow me to slice on Key2 and add up all the rows. If I wanted to see the total of Var1 for Key1='Foo' then I could do something similar. In this instance the total of Var1 for Key2 = 'A' would be 30.

If this is the sort of thing you want to do the for data then a dimensional model could help, and would allow you to put up a statistical reporting facility with an off-the-shelf reporting tool such as Report Builder.

A dimension table for Key1 might look like

Key1    Key1Desc                 Key1Grouping
------- ------------------------ -------------
Foo     Description of 'Foo'     Group A
Bar     Description of 'Bar'     Group A
Wibble  Description of 'Wibble'  Group A
Blarg   Description of 'Blarg'   Group B

By joining this dimension table to the fact table you can then roll up by Key1Grouping and see stats for (say) 'Group A.' You can calculate ratios of Vars after the base additive measures have been aggregated.

Again - I'm still a bit unsure of quite what you want to know.

Database Design for updatable sequential records

Your design seems reasonable to me. While you do have to update all subsequent records when new processes are added or deleted that is easy to accomplish. You just issue an update like:

UPDATE ProcessOrder
SET ProcessOrder = ProcessOrder+1
WHERE ProcessOrder >= [step# where you want to insert]

and then do your insert or delete.

The only other way I can think of would be to design the schema to store the next process id on the row. Something like:

ProcessID | ParentProcessID | NextId
--------------------------------------------------
UUID2     | UUID1           | UUID3
UUID3     | UUID1           | UUID4
UUID4     | UUID1           | NULL

Then if you insert a new step - say between UUID3 and UUID4, you perform more of a linked list operation which will update UUID3|UUID1's NextId to UUID5 and then just insert the new UUID5 with a NextId of UUID4.

This will reduce the UPDATEs to 1 in most cases, but it will make querying the process more difficult as now you have to walk the list from top to bottom to list out step by step.

You need to decide which process you want to favor - inserting and updating or retrieving. If you favor retrieval (which you might if changes are infrequent and reporting is frequent, and the lists are short), then go with your original design. If you favor insert and update (which you might if changes are happening all the time and reporting is infrequent, or lists are really really long), then go with the linked list approach.

I hope this helps. Interested in what other solutions the community might come up with as I'd love to broaden my knowledge around this!

Best Answer

Related Solutions

Is dimensional modelling useful for modelling tables of calculations

Database Design for updatable sequential records

Related Question