Sql-server – SQL Server 2005 replication for a data warehouse

data-warehousereplicationsql serversql-server-2005sql-server-2008

We are trying to implement some kind of Data Warehouse at my current employer.

Most of our data is in a SQL Server 2005, but we might also integrate some Access databases and some databases in a SQL Server 2008 Runtime version (it might matter because of the functionality it lacks). We would like the data to be replicated around once every hour. I wonder if there are people out that already have worked out a solution for this.

Our users would like to make queries to the warehouse from Excel or other apps using ODBC.

Also our amount of data is around 150 GB.

Best Answer

You have "data warehouse" in quotes, which could mean one of two things: you're either looking for an actual data warehouse or you aren't (yay, truisms!).

If you're looking for an actual data warehouse, there's an entire Microsoft BI stack available to you as part of your already-purchased SQL Server licensing: SQL Server Integration Services for ETL, SQL Server Analysis Services for creating cubes, and SQL Server Reporting Services and PowerPivot for analyzing data. Doing a data warehouse right is not a trivial activity, even for somebody with a great deal of experience designing and developing transactional databases. The big problem is not so much learning the tools and languages (though MDX is so different from T-SQL that it's in its own world), but getting the right mindset. If you do go down this route, start with Ralph Kimball's Data Warehouse Toolkit. It's a fairly long book and was not written for any particular technology stack, but instead tries to give you an understanding of what makes for a successful data warehousing project. Doing it right could possibly take years even if you already know the tools, because the single most important thing about data warehousing is understanding the end user. You really have to know what the end users (the business people) want, how they want it, how they can understand what they're seeing, and how what data and systems you have can get translated into what they want in a single, consistent manner.

If you're trying to do something "on the cheap," there's a much higher likelihood of failure, to be honest. This is why I'm a little concerned about why you put "data warehouse" in quotes. Yeah, there are easy-to-develop non-warehouse models you could come up with (such as possibly replicating all of the tables over to another server), but they're typically going to fall into one (or more) of a few categories:

  • Too complex for end users. End users typically don't understand normalized data and shouldn't be expected to know SQL or the details of a GROUP BY clause. Your best outcome in that scenario is that IT people continue writing the reports against this new system; the more likely outcome is that IT people continue writing the reports against the old systems and everybody forgets that the warehouse exists at all...
  • Inconsistent data. As you go through this, you might find System 1 says that Order 37 includes an order of a dozen ink pens, but System 2 says it was 14 pens. And System 3 had a free-form text field where instead of putting in a number, somebody typed in "dog." If they are all supposed to represent the same thing, which is correct? Finding (and working with the business on creating rules to solve) these data inconsistencies is a key part of building a good warehouse model.
  • Inconsistent grain. If one system has order lines and a second only has orders, there are ways to reconcile this, but if you simply throw the tables out there and leave it to the end user, they may come up with the wrong method of reconciliation (i.e., one which gives incorrect results) without knowing it.

These are among the many problems you'll run into in the world of business intelligence. Good data warehousing (be it Kimball-style, Inmon-style, some hybrid, or something else) doesn't avoid these problems, but instead gives you a mindset where you can understand and solve them. Rolling a cheap-and-easy solution may let you paper over the problems in development, but we don't measure success simply through getting out of development; people have to use it for a warehouse of any type to have value.

Hourly updates isn't necessarily that big of a problem for a good warehouse, as long as you have the hardware necessary to do the job. You can create SSIS packages to run on SQL Agent jobs (or through any other kind of scheduler, honestly) which perform the necessary tasks (pulling data from source systems, doing any transformations, doing any necessary staging, and building Analysis Services cubes), so scheduling isn't a problem if the individual processes are fast enough.