Firstly...
Separating Datime/Time
into a Date
dimension and a Time
dimension is definitely the way to go.
To manage multiple time zones you need to duplicate the DateKey
and the TimeKey
so that you have the following:
LocalDateKey
LocalTimeKey
UtcDateKey
UtcTimeKey
You say...
The problem I am having with all that is that 11:00 PM on Tuesday,
December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time
zones that are after UTC+2.
By having the 4 columns I've listed above you, will be able to join the fact table to the Date and/or Time dimension Using Table Aliases (in Kimball terminology these aliased dimension tables are known as "Role Playing Dimensions"), so you would have something like the following:
/*
Assumes the following:
- [DateLongName] has the format of this example "Tuesday, December 31, 2013"
- [TimeShortName] has the format of this example "11:00 PM"
- Both [DateLongName] & [TimeShortName] are strings
*/
select
-- Returns a string matching this example "11:00 PM Tuesday, December 31, 2013"
localTime.TimeShortName + ' ' + localDate.DateLongName
,utcTime.TimeShortName + ' ' + utcDate.DateLongName
,f.*
from
FactTableName AS f
-- Local Date and Local Time joins
inner join dbo.Date AS localDate
on localDate.DateKey = f.LocalDateKey
inner join dbo.Time AS localTime
on localTime.TimeKey = f.LocalTimeKey
-- Utc Date and Utc Time joins
inner join dbo.Date AS utcDate
on utcDate.DateKey = f.UtcDateKey
inner join dbo.Time AS utcTime
on utcTime.TimeKey = f.UtcTimeKey
In closing...
As you're building a data mart, and not an OLTP database, the generation of the Local and Utc times should be performed in your ETL, NOT in any client side applications for the following reasons (apart from localisation of the UTC time to the report reader's perspective):
- Having the calculation reside in any queries places an extra performance burden on them, multiplied by the number of times you have to run said query for any reports you have (this matters when reading millions of rows)
- Extra burden of ensuring the calculation is maintained correctly in each query (especially when you take daylight savings time into account)
- Prevent range scanning of any indexes the column is part of, as you'll be performing a calculation on the column which forces queries to perform index scans instead of seeks (which are typically more expensive as each data page is needed to be read); this is known as being non-sargable.
- Edit due to comments: This applies if you push the conversion down into the actual query.
- Using the concept of having the additional UTC dates and times available, there is nothing stopping you from taking this concept and extending it by calling this
StandardisedDateKey
, or CorporateHQDateKey
, where instead of a UTC date table you standardise based on some other business agreed standard
- Having the two separate column types (Local and UTC), allows for side-by-side comparison across geographic distance. Think -> someone in Australia enters a record that is timestamped with both Local and UTC, someone in New York reads the report with the Local (Australia) date and time and the New York representation of the UTC date and time, thereby seeing that something their Australian counterpart did during the middle of the day (Australia time) happened in the middle of the night their time (New York time). This comparison of time is indispensable in multi-national businesses.
I understand one could use something like EXTRACT on CURRENT_TIMESTAMP when selecting rows
You don't need extract
in order to get the data:
select *
from your_table
where timestamp_column >= timestamp '2007-08-24 18:34:01';
will happily use an index on moment_in_time
and will be very efficient - in fact there will be no difference in performance compared to
select *
from your_table
where milliseconds_column >= 1187906400000;
Now look at the two statements and tell me for which one you immediately understand what the query will return.
If you do need those milliseconds in your application then you can always do something like this:
select some_col,
other_col,
extract(epoch from timestamp_column) * 1000 as millis
from your_table
where timestamp_column >= timestamp '2007-08-24 18:34:01';
And again, I'm almost 100% certain that you won't see any performance difference to:
select some_col,
other_col,
milliseconds_column
from your_table
where milliseconds_column >= 1187906400000;
Plus: if you do store the milliseconds and you want to display the real date, you always need to apply an additional conversion:
select some_col,
other_col,
to_char(timestamp_column, 'yyyy-mm-dd') as the_date
from your_table
where timestamp_column >= timestamp '2007-08-24 18:34:01';
vs.
select some_col,
other_col,
to_char(to_timestamp(milliseconds_column / 1000), 'yyyy-mm-dd) as the_date
from your_table
where milliseconds_column >= 1187906400000;
The second one one be slower than the first one, but it's more typing and makes the code harder to read.
Edit:
how to create a stored procedure that will convert this timestamp into the milliseconds I want (and the reverse function too)?
You don't need to write your own function:
- convert timestamp to milliseconds:
extract(epoch from current_timestamp) * 1000 + extract(milliseconds from current_timestamp)
- convert milliseconds to timestamp:
to_timestamp(1187980443530 / 1000)
So to wrap it up:
Go with a timestamp
column and don't worry about performance.
Best Answer
This just works:
But note several pitfalls and potential misconceptions here!
The SQL construct
BETWEEN ... AND
(not a function, strictly speaking; more like an operator in practice) is not concerned with timezones. That would be a misconception of how things work.Be aware of the two distinct data types
timestamp
(timestamp without time zone
) andtimestamptz
(timestamp with time zone
). See:In particular, the time zone offset in
timestamptz
literals only serves as input / output modifier to the value and is not stored at all. Only the according UTC time is stored internally. See:Be aware of the odd syntax difference between timestamp literals and time zone specifiers (flipped sign) due to disagreement between POSIX and SQL standards:
The output you show (with different time zone offsets) cannot be achieved by simply returning
timestamptz
values:timestamptz
values are always displayed according to thetimezone
setting of the session. To get the strings you display (for varying time zones), you would have to preserve the time zone offset of the input and use it to format the output. (Or store complete input literals astext
)Finally, note that '2018-04-05 04:00:00 -01' also passes the test, since it's exactly the same
timestamptz
value as '2018-04-05 06:00:00+01', just formatted differently (same point in time, displayed for different time zones). So three rows pass the test in your example, not just two.If your head is spinning right now, consider this demo:
db<>fiddle here