Doing what: Loading data from many flat table CSV files into their final destination, a "Target" table.
Using: TSQL / SQL Server 2008
Goals:
Don't include "duplicate" rows
What are duplicates in this context? Two scenarios:
- For some tables, if two rows have the same value in all columns. ex "*"
- For other tables, if two rows have the same value in
some columns ex. "COL_A, COL_B, COL_C"
I know which tables fall into which scenario, and in scenario 2, which columns are the some.
Current Strategy:
- Load "raw" data in batches (i.e. 1-5.csv, 5-10.csv, etc) into separate "Staging"
database - Dynamically create a TSQL script that transfers from
staging to target removing duplicates both…- Within the staged data
itself - And between the staged data, and the target database where
it's being transferred
- Within the staged data
- Execute the transfer TSQL script (in batches corresponding to each Staging Database) to move to a single Target database
For each of the two scenarios (i.e. whether a Table's duplicates is based on all columns (1) or some columns (2)), I have a different transfer query. Here is a pattern of each, respectively:
INSERT INTO TARGET.TABLE1 SELECT * FROM STAGING.TABLE1 EXCEPT SELECT * FROM TARGET.TABLE1
-
INSERT INTO TARGET.TABLE2
SELECT * FROM (
SELECT STAGING.*,
ROW_NUMBER() OVER (PARTITION BY STAGING.COL_A, STAGING.COL_B, STAGING.COL_C) AS ROW_NUMBER
FROM STAGING.TABLE2 as STAGING
LEFT OUTER JOIN TARGET.TABLE2 TARGET
ON ISNULL(STAGING.COL_A, TARGET.COL_A) IS NULL OR STAGING.COL_A = TARGET.COL_A
{... repeat for COL_B and COL_C }
WHERE TARGET.COL_A IS NULL)
as T WHERE ROW_NUMBER = 1
I can add unique constraints if that would help.
My questions:
- Is there anything wrong with overall approach of using staging databases to store "raw data", and transferring batches of that stage into a target, eliminating duplicates within stage and between stage+target during the transfer.
I was thinking it might be more efficient to just load all raw data into its final destination, and "de-duplicate" "in-place" (though I don't know what this looks like)
- If the general approach is sound, are the above queries OK? Is "EXISTS" a safe clause for scenario 1? Is the ROW_NUMBER PARTITION BY and JOIN BY the some columns a fast way to avoid transferring duplicates in scenario 2?
Best Answer
This is a standard and typically the best approach. Staging should be a fast, logicless "ETL" step. Get into your OLTP source systems, get your data, and get out! The only reason to spend time doing logic on the source query is if it will result in a total reduced load on the OLTP. e.g. Incremental logic on a 10million row table may result in a faster query and shorter locks than if you try and do a dumb load of 10million rows every execution.
If you don't have resource contention issues (like a flat file) you can still maintain this approach, such that all of your Transform operations occur in a DB and ETL tool for efficiency, maintenance, and consolidation of ETL logic. Nothing like tricking another dev by leaving a little key piece of logic in the staging steps while the majority of data transformation resides elsewhere.
I personally love Window Functions as they are performant with clean, concise code.
EXCEPT
is also, typically, a very efficient operator and you are likely safe using it. There's nothing immediately wrong with your approach, so I'd save optimization until it actually becomes an issue. Since you used the word "safe" just double check the rules and how they apply to your data - you may get unexpected results if you don't account for howEXCEPT
operates.