Let's say I've got this hypothetical schema:
Source (OLTP) DB:
Table Orders
------------
OrderID int IDENTITY (PK),
CustomerID int NOT NULL,
OrderAmount decimal NOT NULL
Destination (DSS) DB:
Table Activity
--------------
ActivityID int IDENTITY (PK),
PersonID int NOT NULL,
Amount decimal NOT NULL
Table ActivityOrderImport
--------------------
ActivityID int NOT NULL,
SourceOrderID int NOT NULL
Table CustomerMapping
---------------------
CustomerID int NOT NULL,
PersonID int NOT NULL
Obviously the real deal is considerably more complicated with more transformations. But assume for the moment that all this ETL does is merge specific transactions ("orders") from an external entity into a DSS that tracks generic "activities". The links between the external customer and the DSS person are in the CustomerMapping table.
The idea of the "Import" table is to provide some sort of audit trail in case something goes wrong. We don't have a lot of control over the source system and know it to be a bit on the flaky side. So it's really important for us to be able to understand the origin of any given activity.
Right now, there's a script that does this with DDL, which looks something like this:
ALTER TABLE Activity
ADD OrderID int NULL
MERGE Activity
USING #StagingOrders
(...)
INSERT ActivityOrderImport (ActivityID, SourceOrderID)
SELECT a.ActivityID, s.OrderID
FROM #StagingOrders s
INNER JOIN Activity a
ON a.OrderID = s.OrderID
ALTER TABLE Activity
DROP COLUMN OrderID
This works fine, in the sense of not crashing and burning, it just makes me nauseous every time I look at the DDL.
-
Permanently adding an
OrderID
column to theActivity
table really isn't an option because the data may come from multiple sources, each currently needing their own log table. Adding a separate column to the main table for all of these would break normalization and quickly turn the production database into a dog's breakfast. -
Removing the
IDENTITY
constraint from theActivity
table and using some natural derived key is a better option, but still impractical for the same reason – because every system has a different way of defining it, we'd end up having to use annvarchar
column for the primary key (ick). Not to mention we'd lose the sequential-ness which is important for a lot of things.
So I've been wondering, is there a better way to do this, one that doesn't involve DDL but still takes into account the heterogeneous nature of the external keys (and thus the need for different tracking tables for each source)?
I'm not necessarily asking for just a script that'll work under the current design. I realize that the design might have to be modified and I'm open to creating more tables or even separate staging databases; the only things I'm not willing to do at this point are (a) add those tracking columns to the base table and/or (b) remove the IDENTITY
field. It's important to maintain the separation of concerns here and keep the actual activity data separate from the import "log" data.
Any thoughts as to what I could do with this design to achieve all of the objectives here, or have I painted myself into a corner and made that impossible with the current restrictions?
Best Answer
I would suggest adding a (set of) staging tables to the destination or an intermediary that can be better controlled and more stable. Place the tracking info there. Then do all the transformation from the staging to the final destination, either carrying the tracking info with it or discarding it.
There are several ways you can generate a tracking key from multiple systems, as long as they all follow the same algorithm, it doesn't have to be an INT. It could be a 2 char prefix with a sequential number. For that matter it doesn't have to be just one column.