Sql-server – Looking for ideas on maintaining data warehouse in close to real-time – ETL tools, etc

data-warehouseetlsql server

I have done a fair number of traditional data warehouse implementations where the loads were done in a batch-oriented manner, i.e. the data is refreshed nightly or, at most, every few hours. I now face a challenge of creating a system, where the data in the data warehouse needs to be maintained close to real-time (a few minutes delay is OK, but no more than that). I have done a lot of reading and it seems that getting close to real-time has been a trend in DW over the last few years. However, I am having trouble finding specific examples and concrete information about the available tools, which support this kind of "trickle-feed" ETL. It seems that the right tool would be able to read the database transaction log and send those changes over to the data warehouse, while allowing to do some data transformations in-flight. Does anybody has experience with real-time data warehousing and can recommend a good tool or point out a good reading on this subject.

Here are a couple of relevant links:
http://proc.conisar.org/2012/pdf/2214.pdf
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=37&cad=rja&uact=8&ved=0CGQQFjAGOB4&url=http%3A%2F%2Fwww.researchgate.net%2Fpublication%2F226219087_Near_Real_Time_ETL%2Ffile%2F79e4150b23b3aca5aa.pdf&ei=G_dOU8HXI4iy2wW954GoDw&usg=AFQjCNHZVJuxfmuqwjrdi10oxM8v51WcNA&sig2=6OI-xoIz9b0mH_hzITBVrQ

Thank you!

P.S. I work in a Microsoft shop, so the source databases are on Microsoft SQL Server. I do have a good handle on SSIS, but it doesn't seem to fit here.

Best Answer

If you have the Enterprise version of SQL Server, then I would recommend doing this using Change Data Capture (CDC). This is much easier and cleaner than setting up triggers to watch for changed data. Plus it reads the transaction log for changes and then logs those changes in system tables.

You can use SSIS for your ETL. If you have 2012 installed, there are some very handy CDC components available. However you could also use regular components like the OLE DB source and query the built in CDC functions that SQL Server installs when CDC is enabled.

I have used CDC and SSIS to build an integration that is keeping 2 systems in sync while we are transitioning from one to another and it runs every 10 minutes without issue. The key is to keep your SSIS packages as light weight as possible. This means moving lookups away from full cache and other things like that. Ideally, the more often the job runs, the less data you are transmitting each run.

One other thing to consider if you are worried about putting stain on your source system would be look into seeing up an Availability Group read only replica for your source in asynchronous mode. This would allow you to offload your reads from your OLTP database. This does require Enterprise Edition and 2012 or higher though.