Sql-server – Handling time zones in data mart/warehouse

data-warehousedatetimesql-server-2012timezone

We are starting to design the building blocks of a data mart/warehouse and we need to be able to support all time zones (our clients are from all over the world). From reading discussions online (and in books), a common solution seems to be to have a separate date and time dimension as well as a timestamp in the fact tables.

However, the question I am having hard time answering is what good does the date and time dimensions actually do for me considering my dynamic time zone requirements? A time dimension makes a little more sense but I'm having hard time with the date dimension. A general design approach for a date dimension usually includes properties such as day name, day of week, month name, etc. The problem I am having with all that is that 11:00 PM on Tuesday, December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time zones that are after UTC+2.

So if I will have to do all these time zone conversions on each and every query (and report) then what is the point of having and storing these properties that I will probably never use (seems like)? Some people suggest having fact rows for each time zone but that seems ridiculous to me. We need to be able to store millions of records each month.

Others suggest having a time zone bridge table which although makes some sense, it also seems like extra complexity and extra joins to accomplish something that my client apps and reports should easily be able to figure out from a date (reporting will be primarily web-based where there are a myriad of libraries to aide in converting, displaying and formatting dates).

The only thing I can think of is the ease and possibly performance of grouping by date and hour but how bad of a practice is to group by datepart (we're using MS SQL but we will be querying millions of rows) or should we consider just extremely simple date and time dimensions with not much more than hour, day, month and year numbers for the most part as most literals such as Monday wouldn't mean much when time zones come into play?

Best Answer

Firstly...

Separating Datime/Time into a Date dimension and a Time dimension is definitely the way to go.

To manage multiple time zones you need to duplicate the DateKey and the TimeKey so that you have the following:

  • LocalDateKey
  • LocalTimeKey
  • UtcDateKey
  • UtcTimeKey

You say...

The problem I am having with all that is that 11:00 PM on Tuesday, December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time zones that are after UTC+2.

By having the 4 columns I've listed above you, will be able to join the fact table to the Date and/or Time dimension Using Table Aliases (in Kimball terminology these aliased dimension tables are known as "Role Playing Dimensions"), so you would have something like the following:

/*
    Assumes the following:
        - [DateLongName] has the format of this example "Tuesday, December 31, 2013"
        - [TimeShortName] has the format of this example "11:00 PM"
        - Both [DateLongName] & [TimeShortName] are strings
*/
select
    -- Returns a string matching this example  "11:00 PM Tuesday, December 31, 2013"
    localTime.TimeShortName + ' ' + localDate.DateLongName
    ,utcTime.TimeShortName + ' ' + utcDate.DateLongName
    ,f.*
from
    FactTableName  AS f

    -- Local Date and Local Time joins          
    inner join dbo.Date  AS localDate
        on localDate.DateKey = f.LocalDateKey

    inner join dbo.Time  AS localTime
        on localTime.TimeKey = f.LocalTimeKey 

    -- Utc Date and Utc Time joins    
    inner join dbo.Date  AS utcDate
        on utcDate.DateKey = f.UtcDateKey

    inner join dbo.Time  AS utcTime
        on utcTime.TimeKey = f.UtcTimeKey 

In closing...

As you're building a data mart, and not an OLTP database, the generation of the Local and Utc times should be performed in your ETL, NOT in any client side applications for the following reasons (apart from localisation of the UTC time to the report reader's perspective):

  • Having the calculation reside in any queries places an extra performance burden on them, multiplied by the number of times you have to run said query for any reports you have (this matters when reading millions of rows)
  • Extra burden of ensuring the calculation is maintained correctly in each query (especially when you take daylight savings time into account)
  • Prevent range scanning of any indexes the column is part of, as you'll be performing a calculation on the column which forces queries to perform index scans instead of seeks (which are typically more expensive as each data page is needed to be read); this is known as being non-sargable.
    • Edit due to comments: This applies if you push the conversion down into the actual query.
  • Using the concept of having the additional UTC dates and times available, there is nothing stopping you from taking this concept and extending it by calling this StandardisedDateKey, or CorporateHQDateKey, where instead of a UTC date table you standardise based on some other business agreed standard
  • Having the two separate column types (Local and UTC), allows for side-by-side comparison across geographic distance. Think -> someone in Australia enters a record that is timestamped with both Local and UTC, someone in New York reads the report with the Local (Australia) date and time and the New York representation of the UTC date and time, thereby seeing that something their Australian counterpart did during the middle of the day (Australia time) happened in the middle of the night their time (New York time). This comparison of time is indispensable in multi-national businesses.