Sql-server – Maintaining replicated data in a staging-to-warehouse environment

sql-server-2008

I am trying to come up with a way to move metric data from the staging server to the data warehouse servers.

I have a bunch of dimension tables that gets populated during the ETL process (on the staging server), these are being shifted to all the data warehouse servers (in a scale-out environment) using transactional replication… this is working fine due to the nature of these tables (basically since they are dimension tables, they can just be copied over as-is to all servers).

Now am having issues moving the ETL'ed data in the metrics (or data) tables to the warehouse server from the staging server…

because this type of data requires data (from the staging server) to be "Merged" (as in, Update, Insert, Delete) with the existing data in the warehouse server, and once that's done, the data in the staging server that has just been processed needs to be deleted (so it staging server doesn't keep building up data and becoming huge). The data in the data-warehouse must be maintained when the data in the staging server is truncated (so if data is deleted in the staging environment, it doesn't get deleted in the warehouse).

The only way I can think of doing this is by moving the data from "staging server" to "staging table on the data-warehouse server" (through replication), then perform the merge in the data-warehouse to the proper metric data warehousing tables. In this case the main data warehousing tables aren't involved in the replication directly, and so data in those tables doesn't get deleted once I truncate the metric tables in the staging server.

It just sounds in-efficient!, isn't there a way to get replication to handle all this (maybe through using clever replication filters?)

Maybe I shouldn't use replication at all, and just do custom linked server queries that shift and truncates data?, although from experience linked server queries tend not to perform very well.

Any guidance or just general idea dumps would go a long way!

P.S. Sorry if this isn't clear, any questions you have just fire them over

Best Answer

I presume you have substantial data volumes if you have separate staging and data warehouse servers. From your posting It sounds like your staging server is doing a bit more than just staging the data. You describe merging data into the fact table on the warehouse server, so I presume that the staging server actually has ETL processing on it as well.

I would not generally have recommended a transform-push approach like this; running ETL processing from the warehouse and pulling from a staging area is generally better. The merge happens on the warehouse. If you need to scale out you can replicate the warehouse, and the same operations will happen on the source and targets.

However, I guess you're stuck with the transform-push architecture, so you've got a couple of options:

  • Use SSIS or another ETL tool to copy the incremental load to a staging table on the warehouse and merge it to the fact tables from there. This will be fairly efficient and can be done with set operations - i.e. a merge statement or insert and update.

  • Replicate the staging area into another database on your warehouse server and execute the merge from there after the ETL processing on the staging server has run. This is probably more fiddly than the first option, and I can't really see any advantages to it.

  • If your staging and warehouse databases are located on the same database instance then you can create synonyms in the warehouse and merge through the synonyms. The synonyms prevent a need to hard-code database refererences.