We are trying to optimize a data warehouse design that will support reporting against data for many time zones. For example, we might have a report for a month's worth of activity (millions of rows) that needs to show activity grouped by the hour of the day. And of course that hour of the day has to be the "local" hour for the given time zone.
We had a design that worked well when we just supported UTC and one local time. The standard design of Date and Time dimensions for UTC and local time, id's on the Fact tables. However, that approach does not seem to scale if we have to support reporting for 100+ time zones.
Our Fact tables would get very wide. Also, we'd have to solve the syntax issue in SQL of specifying which date and time id's to use for grouping on any given run of the report. Perhaps a very large CASE statement?
I've seen some suggestions to get all the data by the UTC time range you are covering, then return it to the presentation layer to convert to local and aggregate there, but limited testing with SSRS suggests that will be extremely slow.
I've consulted some books on the subject as well, and they all seem to say just have UTC and convert on display or have UTC and one local. Would appreciate any thoughts and suggestions.
Note: This question is similar to: Handling time zones in data mart/warehouse, but I cannot comment on that question, so felt this deserved its own question.
Update: I selected Aaron's answer after he made some significant updates and posted sample code and diagrams. My earlier comments on his answer won't make much sense anymore as they referred to the original edit of the answer. I'll try to come back and update this again if warranted
Best Answer
I've solved this by having a very simple calendar table - each year has one row per supported time zone, with the standard offset and the start datetime / end datetime of DST and its offset (if that time zone supports it). Then an inline, schema-bound, table-valued function that takes the source time (in UTC of course) and adds/subtracts the offset.
This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.
There are a couple of weird edge cases you need to consider:
2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.
Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).
You will want to use a calendar table for all of this, not some gargantuan
CASE
expression (not statement). I just wrote a three-part series for MSSQLTips.com on this; I think the 3rd part will be the most useful for you:A real live example, in the meantime
Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with
_UTC
so there is no confusion.Now, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.
And just to show what a typical seek query will look like against this 10MM row table, if I run this query:
I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:
* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.
It takes a little longer, obviously, if I make my range too large - a month of data takes 258ms, two months takes over 500ms, and so on. Parallelism may kick in:
This is where you start thinking about other, better solutions to satisfy reporting queries, and it has nothing to do with what time zone your output will display. I won't get into that, I just want to demonstrate that time zone conversion is not really going to make your reporting queries suck all that much more, and they may already suck if you are getting large ranges that aren't supported by proper indexes. I'm going to stick to small date ranges to show that the logic is correct, and let you worry about making sure your range-based reporting queries perform adequately, with or without time zone conversions.
Okay, now we need tables to store our time zones (with offsets, in minutes, since not everybody is even hours off UTC) and DST change dates for each supported year. For simplicity, I'm only going to enter a few time zones and a single year to match the data above.
Included a few time zones for variety, some with half hour offsets, some that don't observe DST. Note that Australia, in southern hemisphere observes DST during our winter, so their clocks go back in April and forward in October. (The above table flips the names, but I'm not sure how to make this any less confusing for southern hemisphere time zones.)
Now, a calendar table to know when TZs change. I'm only going to insert rows of interest (each time zone above, and only DST changes for 2014). For ease of calculations back and forth, I store both the moment in UTC where a time zone changes, and the same moment in local time. For time zones that don't observe DST, it's standard all year long, and DST "starts" on January 1.
You can definitely populate this with algorithms (and the upcoming tip series uses some clever set-based techniques, if I do say so myself), rather than loop, populate manually, what have you. For this answer I decided to just manually populate one year for the five time zones, and I'm not going to bother any fancy tricks.
Okay, so we have our fact data, and our "dimension" tables (I cringe when I say that), so what's the logic? Well, I presume you're going to have users select their time zone and enter the date range for the query. I will also assume that the date range will be full days in their own timezone; no partial days, never mind partial hours. So they will pass in a start date, an end date, and a TimeZoneID. From there we will use a scalar function to convert the start/end date from that time zone to UTC, which will allow us to filter the data based on the UTC range. Once we've done that, and performed our aggregations on it, we can then apply the conversion of the grouped times back to the source time zone, before displaying to the user.
The scalar UDF:
And the table-valued function:
And a procedure that uses it (edit: updated to handle 30-minute offset grouping):
(You may want to have a go at short circuiting there, or a separate stored procedure, in the event that the user wants reporting in UTC - obviously translating to and from UTC is going to be wasteful busy work.)
Sample call:
Returns in 41ms*, and generates this plan:
* Again, with discarded results.
For 2 months, it returns in 507ms, and the plan is identical other than rowcounts:
While slightly more complex and increasing run time a little bit, I am fairly confident that this type of approach will work out much, much better than the bridge table approach. And this is an off-the cuff example for a dba.se answer; I'm sure my logic and efficiency could be improved by folks much smarter than me.
You can peruse the data to see the edge cases I talk about - no row of output for the hour where clocks roll forward, two rows for the hour where they rolled back (and that hour happened twice). You can also play with bad values; if you pass in 20140309 02:30 Eastern time, for example, it's not going to work too well.
I might not have all of the assumptions right about how your reporting will work, so you may have to make some adjustments. But I think this covers the basics.