SQL Server – When to Run ETL for Dimensions

data-warehouseetlsql server

We have a suite of restaurant web applications that stores data in SQL Server 2012 on database server. We have now started to move data for reporting onto a separate data warehouse server. The data needs to get into the data warehouse quite soon after being updated/created.

Knowing when to move transactions to the data warehouse (and into fact tables) is easy — we do that periodically or when certain events occur (like the manager finalising something).

But how do I know when to move the definitions to the data warehouse (and into dimension tables)? We are storing definitions as SCD Type 1.

We’re considering…

  1. On a schedule, checking all the definitions on the database server with dimensions and updating any changes. And assigning transactions to “Unknown” is the dimension hasn’t arrived yet.
  2. Putting triggers on all the definition table and updating the data warehouse as the changes happen
  3. Updating the stored procedures to update the data warehouse every time a change is made
  4. Using something like Microsoft Sync Framework, and incorporating the transform in there.

The database and data warehouse servers are on separate servers – which we are planning to move to SQL Azure.

So, how do I know when to run the ETL for dimensions? And, any suggestions on technology would be helpful (we are currently using SQL Server linked servers).

Best Answer

You have several options - all of them require some scheduling decision on your part. The one with the least impact to your source system would be, as mentioned in the comments, looking for an audit date/time stamp from your OLTP system. Many systems put a datetime stamp on rows whenever there is an insert or update, and as such can be used reliably to capture changes. This is safest if you use a >= pattern, so that you don't miss anything committed while you read. The likely small record set duplicated from the last millisecond records shouldn't cause much of an issue for standard upsert logic.

You can keep a table that tracks the latest Audit DateTime per source and use this though it may become unwieldy if you are pulling from numerous tables for a single dimension. If you know your system you may also use things like PK increments or a similar dependable data change pattern to restrict your source query.

CREATE TABLE [UTIL].[IncrementalLoadHelper]
(
     [Id] INT NOT NULL IDENTITY PRIMARY KEY
    ,[TargetTableSchema] VARCHAR(20) NOT NULL
    ,[TargetTableName] VARCHAR(100) NOT NULL
    ,[IncrementalValueInt] INT NULL
    ,[IncrementalValueDateTime] Datetime NULL
    ,[IncrementalValueVarchar] Varchar(500) NULL
    ,[IncrementalValueDecimal] Decimal NULL
    ,[SuccessfulLoadFlag] bit NOT NULL
    ,[InsertRowCount] INT NULL
    ,[UpdateRowCount] INT NULL
    ,[SSISPackageName] Varchar(500) NOT NULL
    ,[AuditInsertDate] DATETIME NOT NULL DEFAULT GETDATE()
    ,[AuditUpdateDate] DATETIME NULL
    ,[AuditUser] VARCHAR(200) NOT NULL DEFAULT SUSER_NAME()
)

.

SELECT [Columns]
FROM OLTP.SourceSystem
WHERE AuditDateTime >=
(
 SELECT MAX(AuditDateTime)
 FROM TrackingTable
 WHERE TableName = 'TblName'
)

You could also enable Change Tracking which uses built-in SQL Server functionality and doesn't require a helpful OLTP DB design. It will use less resources than Change Data Capture, which is a similar solution. Either one of these creates easily queried tables to identify changes that have occurred since a certain time and allow you to greatly reduce the total data that moves during each ETL phase.

I'd recommend dropping the linked servers and use SSIS to move your data. It will likely be easier to maintain and design. It has pre-built tasks that work with CDC. Either put SSIS on your DW machine or create a dedicated ETL processing machine and move as much of the ETL resource demand off of the DW and the OLTP.

Since you mentioned getting the facts first ensure you have the right inferred member logic to allow resolution at a later date when all data is available.

Also, consider, or make the business consider, the value of faster loads vs resource and development/maintenance costs. "Real time data" sounds great until you see it's alias: $$$$.