Sql-server – non-DDL way to preserve back-references when staging data into production

etlsql serversql-server-2008

Let's say I've got this hypothetical schema:

Source (OLTP) DB:

Table Orders
------------
OrderID int IDENTITY (PK),
CustomerID int NOT NULL,
OrderAmount decimal NOT NULL

Destination (DSS) DB:

Table Activity
--------------
ActivityID int IDENTITY (PK),
PersonID int NOT NULL,
Amount decimal NOT NULL

Table ActivityOrderImport
--------------------
ActivityID int NOT NULL,
SourceOrderID int NOT NULL

Table CustomerMapping
---------------------
CustomerID int NOT NULL,
PersonID int NOT NULL

Obviously the real deal is considerably more complicated with more transformations. But assume for the moment that all this ETL does is merge specific transactions ("orders") from an external entity into a DSS that tracks generic "activities". The links between the external customer and the DSS person are in the CustomerMapping table.

The idea of the "Import" table is to provide some sort of audit trail in case something goes wrong. We don't have a lot of control over the source system and know it to be a bit on the flaky side. So it's really important for us to be able to understand the origin of any given activity.

Right now, there's a script that does this with DDL, which looks something like this:

ALTER TABLE Activity
ADD OrderID int NULL

MERGE Activity
USING #StagingOrders
(...)

INSERT ActivityOrderImport (ActivityID, SourceOrderID)
SELECT a.ActivityID, s.OrderID
FROM #StagingOrders s
INNER JOIN Activity a
    ON a.OrderID = s.OrderID

ALTER TABLE Activity
DROP COLUMN OrderID

This works fine, in the sense of not crashing and burning, it just makes me nauseous every time I look at the DDL.

  • Permanently adding an OrderID column to the Activity table really isn't an option because the data may come from multiple sources, each currently needing their own log table. Adding a separate column to the main table for all of these would break normalization and quickly turn the production database into a dog's breakfast.

  • Removing the IDENTITY constraint from the Activity table and using some natural derived key is a better option, but still impractical for the same reason – because every system has a different way of defining it, we'd end up having to use an nvarchar column for the primary key (ick). Not to mention we'd lose the sequential-ness which is important for a lot of things.

So I've been wondering, is there a better way to do this, one that doesn't involve DDL but still takes into account the heterogeneous nature of the external keys (and thus the need for different tracking tables for each source)?

I'm not necessarily asking for just a script that'll work under the current design. I realize that the design might have to be modified and I'm open to creating more tables or even separate staging databases; the only things I'm not willing to do at this point are (a) add those tracking columns to the base table and/or (b) remove the IDENTITY field. It's important to maintain the separation of concerns here and keep the actual activity data separate from the import "log" data.

Any thoughts as to what I could do with this design to achieve all of the objectives here, or have I painted myself into a corner and made that impossible with the current restrictions?

Best Answer

I would suggest adding a (set of) staging tables to the destination or an intermediary that can be better controlled and more stable. Place the tracking info there. Then do all the transformation from the staging to the final destination, either carrying the tracking info with it or discarding it.

There are several ways you can generate a tracking key from multiple systems, as long as they all follow the same algorithm, it doesn't have to be an INT. It could be a 2 char prefix with a sequential number. For that matter it doesn't have to be just one column.