Data Warehouse Design for Reporting Across Time Zones

data-warehousedatabase-designreportingsql servertimezone

We are trying to optimize a data warehouse design that will support reporting against data for many time zones. For example, we might have a report for a month's worth of activity (millions of rows) that needs to show activity grouped by the hour of the day. And of course that hour of the day has to be the "local" hour for the given time zone.

We had a design that worked well when we just supported UTC and one local time. The standard design of Date and Time dimensions for UTC and local time, id's on the Fact tables. However, that approach does not seem to scale if we have to support reporting for 100+ time zones.

Our Fact tables would get very wide. Also, we'd have to solve the syntax issue in SQL of specifying which date and time id's to use for grouping on any given run of the report. Perhaps a very large CASE statement?

I've seen some suggestions to get all the data by the UTC time range you are covering, then return it to the presentation layer to convert to local and aggregate there, but limited testing with SSRS suggests that will be extremely slow.

I've consulted some books on the subject as well, and they all seem to say just have UTC and convert on display or have UTC and one local. Would appreciate any thoughts and suggestions.

Note: This question is similar to: Handling time zones in data mart/warehouse, but I cannot comment on that question, so felt this deserved its own question.

Update: I selected Aaron's answer after he made some significant updates and posted sample code and diagrams. My earlier comments on his answer won't make much sense anymore as they referred to the original edit of the answer. I'll try to come back and update this again if warranted

Best Answer

I've solved this by having a very simple calendar table - each year has one row per supported time zone, with the standard offset and the start datetime / end datetime of DST and its offset (if that time zone supports it). Then an inline, schema-bound, table-valued function that takes the source time (in UTC of course) and adds/subtracts the offset.

This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.

There are a couple of weird edge cases you need to consider:

2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.

Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).

You will want to use a calendar table for all of this, not some gargantuan CASE expression (not statement). I just wrote a three-part series for MSSQLTips.com on this; I think the 3rd part will be the most useful for you:

A real live example, in the meantime

Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with _UTC so there is no confusion.

CREATE TABLE dbo.Fact
(
  EventTime_UTC DATETIME NOT NULL,
  Filler UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID()
);
GO

CREATE CLUSTERED INDEX x ON dbo.Fact(EventTime_UTC);
GO

Now, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.

;WITH x(c) AS 
(
  SELECT TOP (10000000) DATEADD(SECOND, 
    3*(ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1),
    '20131230')
  FROM sys.all_columns AS s1
  CROSS JOIN sys.all_columns AS s2
  ORDER BY s1.[object_id]
)
INSERT dbo.Fact WITH (TABLOCKX) (EventTime_UTC) 
  SELECT c FROM x;

And just to show what a typical seek query will look like against this 10MM row table, if I run this query:

SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0),
  COUNT(*)
FROM dbo.Fact 
WHERE EventTime_UTC >= '20140308'
AND EventTime_UTC < '20140311'
GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0);

I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:

_{* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.}

It takes a little longer, obviously, if I make my range too large - a month of data takes 258ms, two months takes over 500ms, and so on. Parallelism may kick in:

This is where you start thinking about other, better solutions to satisfy reporting queries, and it has nothing to do with what time zone your output will display. I won't get into that, I just want to demonstrate that time zone conversion is not really going to make your reporting queries suck all that much more, and they may already suck if you are getting large ranges that aren't supported by proper indexes. I'm going to stick to small date ranges to show that the logic is correct, and let you worry about making sure your range-based reporting queries perform adequately, with or without time zone conversions.

Okay, now we need tables to store our time zones (with offsets, in minutes, since not everybody is even hours off UTC) and DST change dates for each supported year. For simplicity, I'm only going to enter a few time zones and a single year to match the data above.

CREATE TABLE dbo.TimeZones
(
  TimeZoneID TINYINT    NOT NULL PRIMARY KEY,
  Name       VARCHAR(9) NOT NULL,
  Offset     SMALLINT   NOT NULL, -- minutes
  DSTName    VARCHAR(9) NOT NULL,
  DSTOffset  SMALLINT   NOT NULL  -- minutes
);

Included a few time zones for variety, some with half hour offsets, some that don't observe DST. Note that Australia, in southern hemisphere observes DST during our winter, so their clocks go back in April and forward in October. (The above table flips the names, but I'm not sure how to make this any less confusing for southern hemisphere time zones.)

INSERT dbo.TimeZones VALUES
(1, 'UTC',     0, 'UTC',     0),
(2, 'GMT',     0, 'BST',    60), 
     -- London = UTC in winter, +1 in summer
(3, 'EST',  -300, 'EDT',  -240), 
     -- East coast US (-5 h in winter, -4 in summer)
(4, 'ACDT',  630, 'ACST',  570), 
     -- Adelaide (Australia) +10.5 h Oct - Apr, +9.5 Apr - Oct
(5, 'ACST',  570, 'ACST',  570); 
     -- Darwin (Australia) +9.5 h year round

Now, a calendar table to know when TZs change. I'm only going to insert rows of interest (each time zone above, and only DST changes for 2014). For ease of calculations back and forth, I store both the moment in UTC where a time zone changes, and the same moment in local time. For time zones that don't observe DST, it's standard all year long, and DST "starts" on January 1.

CREATE TABLE dbo.Calendar
(
  TimeZoneID    TINYINT NOT NULL FOREIGN KEY
                REFERENCES dbo.TimeZones(TimeZoneID),
  [Year]        SMALLDATETIME NOT NULL,
  UTCDSTStart   SMALLDATETIME NOT NULL,
  UTCDSTEnd     SMALLDATETIME NOT NULL,
  LocalDSTStart SMALLDATETIME NOT NULL,
  LocalDSTEnd   SMALLDATETIME NOT NULL,
  PRIMARY KEY (TimeZoneID, [Year])
);

You can definitely populate this with algorithms (and the upcoming tip series uses some clever set-based techniques, if I do say so myself), rather than loop, populate manually, what have you. For this answer I decided to just manually populate one year for the five time zones, and I'm not going to bother any fancy tricks.

INSERT dbo.Calendar VALUES
(1, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00'),
(2, '20140101', '20140330 01:00','20141026 00:00','20140330 02:00','20141026 01:00'),
(3, '20140101', '20140309 07:00','20141102 06:00','20140309 03:00','20141102 01:00'),
(4, '20140101', '20140405 16:30','20141004 16:30','20140406 03:00','20141005 02:00'),
(5, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00');

Okay, so we have our fact data, and our "dimension" tables (I cringe when I say that), so what's the logic? Well, I presume you're going to have users select their time zone and enter the date range for the query. I will also assume that the date range will be full days in their own timezone; no partial days, never mind partial hours. So they will pass in a start date, an end date, and a TimeZoneID. From there we will use a scalar function to convert the start/end date from that time zone to UTC, which will allow us to filter the data based on the UTC range. Once we've done that, and performed our aggregations on it, we can then apply the conversion of the grouped times back to the source time zone, before displaying to the user.

The scalar UDF:

CREATE FUNCTION dbo.ConvertToUTC
(
  @Source   SMALLDATETIME,
  @SourceTZ TINYINT
)
RETURNS SMALLDATETIME
WITH SCHEMABINDING
AS
BEGIN
  RETURN 
  (
    SELECT DATEADD(MINUTE, -CASE 
        WHEN @Source >= src.LocalDSTStart 
         AND @Source < src.LocalDSTEnd THEN t.DSTOffset 
        WHEN @Source >= DATEADD(HOUR,-1,src.LocalDSTStart) 
         AND @Source < src.LocalDSTStart THEN NULL
        ELSE t.Offset END, @Source)
    FROM dbo.Calendar AS src
    INNER JOIN dbo.TimeZones AS t 
    ON src.TimeZoneID = t.TimeZoneID
    WHERE src.TimeZoneID = @SourceTZ 
      AND t.TimeZoneID = @SourceTZ
      AND DATEADD(MINUTE,t.Offset,@Source) >= src.[Year]
      AND DATEADD(MINUTE,t.Offset,@Source) < DATEADD(YEAR, 1, src.[Year])
  );
END
GO

And the table-valued function:

CREATE FUNCTION dbo.ConvertFromUTC
(
  @Source   SMALLDATETIME,
  @SourceTZ TINYINT
)
RETURNS TABLE
WITH SCHEMABINDING
AS
 RETURN 
 (
  SELECT 
     [Target] = DATEADD(MINUTE, CASE 
       WHEN @Source >= trg.UTCDSTStart 
        AND @Source < trg.UTCDSTEnd THEN tz.DSTOffset 
       ELSE tz.Offset END, @Source)
  FROM dbo.Calendar AS trg
  INNER JOIN dbo.TimeZones AS tz
  ON trg.TimeZoneID = tz.TimeZoneID
  WHERE trg.TimeZoneID = @SourceTZ 
  AND tz.TimeZoneID = @SourceTZ
  AND @Source >= trg.[Year] 
  AND @Source < DATEADD(YEAR, 1, trg.[Year])
);

And a procedure that uses it (edit: updated to handle 30-minute offset grouping):

CREATE PROCEDURE dbo.ReportOnDateRange
  @Start      SMALLDATETIME, -- whole dates only please! 
  @End        SMALLDATETIME, -- whole dates only please!
  @TimeZoneID TINYINT
AS 
BEGIN
  SET NOCOUNT ON;

  SELECT @Start = dbo.ConvertToUTC(@Start, @TimeZoneID),
         @End   = dbo.ConvertToUTC(@End,   @TimeZoneID);

  ;WITH x(t,c) AS
  (
    SELECT DATEDIFF(MINUTE, @Start, EventTime_UTC)/60, 
      COUNT(*) 
    FROM dbo.Fact 
    WHERE EventTime_UTC >= @Start
      AND EventTime_UTC <  DATEADD(DAY, 1, @End)
    GROUP BY DATEDIFF(MINUTE, @Start, EventTime_UTC)/60
  )
  SELECT 
    UTC = DATEADD(MINUTE, x.t*60, @Start), 
    [Local] = y.[Target], 
    [RowCount] = x.c 
  FROM x OUTER APPLY 
    dbo.ConvertFromUTC(DATEADD(MINUTE, x.t*60, @Start), @TimeZoneID) AS y
  ORDER BY UTC;
END
GO

(You may want to have a go at short circuiting there, or a separate stored procedure, in the event that the user wants reporting in UTC - obviously translating to and from UTC is going to be wasteful busy work.)

Sample call:

EXEC dbo.ReportOnDateRange 
  @Start      = '20140308', 
  @End        = '20140311', 
  @TimeZoneID = 3;

Returns in 41ms*, and generates this plan:

_{* Again, with discarded results.}

For 2 months, it returns in 507ms, and the plan is identical other than rowcounts:

While slightly more complex and increasing run time a little bit, I am fairly confident that this type of approach will work out much, much better than the bridge table approach. And this is an off-the cuff example for a dba.se answer; I'm sure my logic and efficiency could be improved by folks much smarter than me.

You can peruse the data to see the edge cases I talk about - no row of output for the hour where clocks roll forward, two rows for the hour where they rolled back (and that hour happened twice). You can also play with bad values; if you pass in 20140309 02:30 Eastern time, for example, it's not going to work too well.

I might not have all of the assumptions right about how your reporting will work, so you may have to make some adjustments. But I think this covers the basics.

Firstly...

Separating Datime/Time into a Date dimension and a Time dimension is definitely the way to go.

To manage multiple time zones you need to duplicate the DateKey and the TimeKey so that you have the following:

LocalDateKey
LocalTimeKey
UtcDateKey
UtcTimeKey

You say...

The problem I am having with all that is that 11:00 PM on Tuesday, December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time zones that are after UTC+2.

By having the 4 columns I've listed above you, will be able to join the fact table to the Date and/or Time dimension Using Table Aliases (in Kimball terminology these aliased dimension tables are known as "Role Playing Dimensions"), so you would have something like the following:

/*
    Assumes the following:
        - [DateLongName] has the format of this example "Tuesday, December 31, 2013"
        - [TimeShortName] has the format of this example "11:00 PM"
        - Both [DateLongName] & [TimeShortName] are strings
*/
select
    -- Returns a string matching this example  "11:00 PM Tuesday, December 31, 2013"
    localTime.TimeShortName + ' ' + localDate.DateLongName
    ,utcTime.TimeShortName + ' ' + utcDate.DateLongName
    ,f.*
from
    FactTableName  AS f

    -- Local Date and Local Time joins          
    inner join dbo.Date  AS localDate
        on localDate.DateKey = f.LocalDateKey

    inner join dbo.Time  AS localTime
        on localTime.TimeKey = f.LocalTimeKey 

    -- Utc Date and Utc Time joins    
    inner join dbo.Date  AS utcDate
        on utcDate.DateKey = f.UtcDateKey

    inner join dbo.Time  AS utcTime
        on utcTime.TimeKey = f.UtcTimeKey

In closing...

As you're building a data mart, and not an OLTP database, the generation of the Local and Utc times should be performed in your ETL, NOT in any client side applications for the following reasons (apart from localisation of the UTC time to the report reader's perspective):

Having the calculation reside in any queries places an extra performance burden on them, multiplied by the number of times you have to run said query for any reports you have (this matters when reading millions of rows)
Extra burden of ensuring the calculation is maintained correctly in each query (especially when you take daylight savings time into account)
Prevent range scanning of any indexes the column is part of, as you'll be performing a calculation on the column which forces queries to perform index scans instead of seeks (which are typically more expensive as each data page is needed to be read); this is known as being non-sargable.
- Edit due to comments: This applies if you push the conversion down into the actual query.
Using the concept of having the additional UTC dates and times available, there is nothing stopping you from taking this concept and extending it by calling this StandardisedDateKey, or CorporateHQDateKey, where instead of a UTC date table you standardise based on some other business agreed standard
Having the two separate column types (Local and UTC), allows for side-by-side comparison across geographic distance. Think -> someone in Australia enters a record that is timestamped with both Local and UTC, someone in New York reads the report with the Local (Australia) date and time and the New York representation of the UTC date and time, thereby seeing that something their Australian counterpart did during the middle of the day (Australia time) happened in the middle of the night their time (New York time). This comparison of time is indispensable in multi-national businesses.

Best Answer

Related Solutions

SQL Server – How to Get Correct Offset Between UTC and Local Times

Sql-server – Handling time zones in data mart/warehouse

Firstly...

You say...

In closing...

Related Question