Sql-server – Best practices for transferring data across long, wide tables

change-trackingdata-warehousesql serversql server 2014

As part of my data warehouse project, I'll be transferring data from our OLTPs to the Data Warehouse. Some of the tables are long and wide so obviously, I'll only be transferring required columns.

To reduce the overhead, I'm considering data tracking on these tables so we only look at changed values rather than scanning the complete table. Unfortunately our source system runs on SQL Server 2014 Standard Edition and CDC is an option we can not take advantage of.

Is there an alternative method that would be better than change tracking?

Best Answer

Have you considered adding a ''rowversion'' column to your source tables?

The simplest way to think about rowversion (formerly timestamp in SQL Server) is as a database-wide autonumber. Every INSERT and UPDATE increments the database-wide rowversion. Any table that has a rowversion column defined has that column's value set to the database rowversion when the associated row is INSERTed or UPDATEd.

Here's a brief algorithm for how you could use this:

Save the database-wide rowversion value (available via @@DBTS; i.e., DataBase TimeStamp)
Process your data into your Data Warehouse
Next time through, only process rows where the rowversion value is greater than the previously saved @@DBTS

Note that while you should not miss any rows that should be processed, you may re-process rows with no actual change in data. This is because the rowversion is incremented with every UPDATE statement, even if the data is "UPDATEd" to its current value and not actually changed.

Also, it won't help identify which columns might have been changed. As the name implies, it works strictly at the row level.

Related Solutions

Sql-server – Data Warehouse design for reporting against data for many time zones

I've solved this by having a very simple calendar table - each year has one row per supported time zone, with the standard offset and the start datetime / end datetime of DST and its offset (if that time zone supports it). Then an inline, schema-bound, table-valued function that takes the source time (in UTC of course) and adds/subtracts the offset.

This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.

There are a couple of weird edge cases you need to consider:

2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.

Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).

You will want to use a calendar table for all of this, not some gargantuan CASE expression (not statement). I just wrote a three-part series for MSSQLTips.com on this; I think the 3rd part will be the most useful for you:

A real live example, in the meantime

Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with _UTC so there is no confusion.

CREATE TABLE dbo.Fact
(
  EventTime_UTC DATETIME NOT NULL,
  Filler UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID()
);
GO

CREATE CLUSTERED INDEX x ON dbo.Fact(EventTime_UTC);
GO

Now, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.

;WITH x(c) AS 
(
  SELECT TOP (10000000) DATEADD(SECOND, 
    3*(ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1),
    '20131230')
  FROM sys.all_columns AS s1
  CROSS JOIN sys.all_columns AS s2
  ORDER BY s1.[object_id]
)
INSERT dbo.Fact WITH (TABLOCKX) (EventTime_UTC) 
  SELECT c FROM x;

And just to show what a typical seek query will look like against this 10MM row table, if I run this query:

SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0),
  COUNT(*)
FROM dbo.Fact 
WHERE EventTime_UTC >= '20140308'
AND EventTime_UTC < '20140311'
GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0);

I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:

_{* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.}

It takes a little longer, obviously, if I make my range too large - a month of data takes 258ms, two months takes over 500ms, and so on. Parallelism may kick in:

This is where you start thinking about other, better solutions to satisfy reporting queries, and it has nothing to do with what time zone your output will display. I won't get into that, I just want to demonstrate that time zone conversion is not really going to make your reporting queries suck all that much more, and they may already suck if you are getting large ranges that aren't supported by proper indexes. I'm going to stick to small date ranges to show that the logic is correct, and let you worry about making sure your range-based reporting queries perform adequately, with or without time zone conversions.

Okay, now we need tables to store our time zones (with offsets, in minutes, since not everybody is even hours off UTC) and DST change dates for each supported year. For simplicity, I'm only going to enter a few time zones and a single year to match the data above.

CREATE TABLE dbo.TimeZones
(
  TimeZoneID TINYINT    NOT NULL PRIMARY KEY,
  Name       VARCHAR(9) NOT NULL,
  Offset     SMALLINT   NOT NULL, -- minutes
  DSTName    VARCHAR(9) NOT NULL,
  DSTOffset  SMALLINT   NOT NULL  -- minutes
);

Included a few time zones for variety, some with half hour offsets, some that don't observe DST. Note that Australia, in southern hemisphere observes DST during our winter, so their clocks go back in April and forward in October. (The above table flips the names, but I'm not sure how to make this any less confusing for southern hemisphere time zones.)

INSERT dbo.TimeZones VALUES
(1, 'UTC',     0, 'UTC',     0),
(2, 'GMT',     0, 'BST',    60), 
     -- London = UTC in winter, +1 in summer
(3, 'EST',  -300, 'EDT',  -240), 
     -- East coast US (-5 h in winter, -4 in summer)
(4, 'ACDT',  630, 'ACST',  570), 
     -- Adelaide (Australia) +10.5 h Oct - Apr, +9.5 Apr - Oct
(5, 'ACST',  570, 'ACST',  570); 
     -- Darwin (Australia) +9.5 h year round

Now, a calendar table to know when TZs change. I'm only going to insert rows of interest (each time zone above, and only DST changes for 2014). For ease of calculations back and forth, I store both the moment in UTC where a time zone changes, and the same moment in local time. For time zones that don't observe DST, it's standard all year long, and DST "starts" on January 1.

CREATE TABLE dbo.Calendar
(
  TimeZoneID    TINYINT NOT NULL FOREIGN KEY
                REFERENCES dbo.TimeZones(TimeZoneID),
  [Year]        SMALLDATETIME NOT NULL,
  UTCDSTStart   SMALLDATETIME NOT NULL,
  UTCDSTEnd     SMALLDATETIME NOT NULL,
  LocalDSTStart SMALLDATETIME NOT NULL,
  LocalDSTEnd   SMALLDATETIME NOT NULL,
  PRIMARY KEY (TimeZoneID, [Year])
);

You can definitely populate this with algorithms (and the upcoming tip series uses some clever set-based techniques, if I do say so myself), rather than loop, populate manually, what have you. For this answer I decided to just manually populate one year for the five time zones, and I'm not going to bother any fancy tricks.

INSERT dbo.Calendar VALUES
(1, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00'),
(2, '20140101', '20140330 01:00','20141026 00:00','20140330 02:00','20141026 01:00'),
(3, '20140101', '20140309 07:00','20141102 06:00','20140309 03:00','20141102 01:00'),
(4, '20140101', '20140405 16:30','20141004 16:30','20140406 03:00','20141005 02:00'),
(5, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00');

Okay, so we have our fact data, and our "dimension" tables (I cringe when I say that), so what's the logic? Well, I presume you're going to have users select their time zone and enter the date range for the query. I will also assume that the date range will be full days in their own timezone; no partial days, never mind partial hours. So they will pass in a start date, an end date, and a TimeZoneID. From there we will use a scalar function to convert the start/end date from that time zone to UTC, which will allow us to filter the data based on the UTC range. Once we've done that, and performed our aggregations on it, we can then apply the conversion of the grouped times back to the source time zone, before displaying to the user.

The scalar UDF:

CREATE FUNCTION dbo.ConvertToUTC
(
  @Source   SMALLDATETIME,
  @SourceTZ TINYINT
)
RETURNS SMALLDATETIME
WITH SCHEMABINDING
AS
BEGIN
  RETURN 
  (
    SELECT DATEADD(MINUTE, -CASE 
        WHEN @Source >= src.LocalDSTStart 
         AND @Source < src.LocalDSTEnd THEN t.DSTOffset 
        WHEN @Source >= DATEADD(HOUR,-1,src.LocalDSTStart) 
         AND @Source < src.LocalDSTStart THEN NULL
        ELSE t.Offset END, @Source)
    FROM dbo.Calendar AS src
    INNER JOIN dbo.TimeZones AS t 
    ON src.TimeZoneID = t.TimeZoneID
    WHERE src.TimeZoneID = @SourceTZ 
      AND t.TimeZoneID = @SourceTZ
      AND DATEADD(MINUTE,t.Offset,@Source) >= src.[Year]
      AND DATEADD(MINUTE,t.Offset,@Source) < DATEADD(YEAR, 1, src.[Year])
  );
END
GO

And the table-valued function:

CREATE FUNCTION dbo.ConvertFromUTC
(
  @Source   SMALLDATETIME,
  @SourceTZ TINYINT
)
RETURNS TABLE
WITH SCHEMABINDING
AS
 RETURN 
 (
  SELECT 
     [Target] = DATEADD(MINUTE, CASE 
       WHEN @Source >= trg.UTCDSTStart 
        AND @Source < trg.UTCDSTEnd THEN tz.DSTOffset 
       ELSE tz.Offset END, @Source)
  FROM dbo.Calendar AS trg
  INNER JOIN dbo.TimeZones AS tz
  ON trg.TimeZoneID = tz.TimeZoneID
  WHERE trg.TimeZoneID = @SourceTZ 
  AND tz.TimeZoneID = @SourceTZ
  AND @Source >= trg.[Year] 
  AND @Source < DATEADD(YEAR, 1, trg.[Year])
);

And a procedure that uses it (edit: updated to handle 30-minute offset grouping):

CREATE PROCEDURE dbo.ReportOnDateRange
  @Start      SMALLDATETIME, -- whole dates only please! 
  @End        SMALLDATETIME, -- whole dates only please!
  @TimeZoneID TINYINT
AS 
BEGIN
  SET NOCOUNT ON;

  SELECT @Start = dbo.ConvertToUTC(@Start, @TimeZoneID),
         @End   = dbo.ConvertToUTC(@End,   @TimeZoneID);

  ;WITH x(t,c) AS
  (
    SELECT DATEDIFF(MINUTE, @Start, EventTime_UTC)/60, 
      COUNT(*) 
    FROM dbo.Fact 
    WHERE EventTime_UTC >= @Start
      AND EventTime_UTC <  DATEADD(DAY, 1, @End)
    GROUP BY DATEDIFF(MINUTE, @Start, EventTime_UTC)/60
  )
  SELECT 
    UTC = DATEADD(MINUTE, x.t*60, @Start), 
    [Local] = y.[Target], 
    [RowCount] = x.c 
  FROM x OUTER APPLY 
    dbo.ConvertFromUTC(DATEADD(MINUTE, x.t*60, @Start), @TimeZoneID) AS y
  ORDER BY UTC;
END
GO

(You may want to have a go at short circuiting there, or a separate stored procedure, in the event that the user wants reporting in UTC - obviously translating to and from UTC is going to be wasteful busy work.)

Sample call:

EXEC dbo.ReportOnDateRange 
  @Start      = '20140308', 
  @End        = '20140311', 
  @TimeZoneID = 3;

Returns in 41ms*, and generates this plan:

_{* Again, with discarded results.}

For 2 months, it returns in 507ms, and the plan is identical other than rowcounts:

While slightly more complex and increasing run time a little bit, I am fairly confident that this type of approach will work out much, much better than the bridge table approach. And this is an off-the cuff example for a dba.se answer; I'm sure my logic and efficiency could be improved by folks much smarter than me.

You can peruse the data to see the edge cases I talk about - no row of output for the hour where clocks roll forward, two rows for the hour where they rolled back (and that hour happened twice). You can also play with bad values; if you pass in 20140309 02:30 Eastern time, for example, it's not going to work too well.

I might not have all of the assumptions right about how your reporting will work, so you may have to make some adjustments. But I think this covers the basics.

Sql-server – Change Tracking across several Tables

My plan for this, is to add a rowversion to each of the source tables, and display the max rowversion from the source tables.

This is making the assumption that rowversion is unique across the entire database and is not incremented in each table (I need to check if that assumption is correct).

SELECT 
 ft.[col1]
,ft.[col2]
,ft.[col3]
,ft.[col4]
,dt.dim_id
,dt.name
, (SELECT MAX(rv) 
   FROM (VALUES (ft.row_ver), (dt.row_ver)) AS value(rv)) AS max_row_ver
FROM [dbo].[myFirstTable] ft
INNER JOIN [dbo].[dim_table1] dt ON ft.[col5_fk] = dt.dim_id

Obviously this only gives you an identifier to use to track changes, you would still need to create a table to track it.

Best Answer

Related Solutions

Sql-server – Data Warehouse design for reporting against data for many time zones

Sql-server – Change Tracking across several Tables

Related Question