I work with large third party data-sets. Long experience has taught me that it is a really good idea to give incoming lines of data a surrogate ID as soon as it enters my system so I can track it easily as it's validated, warehoused, etc. The problem is the key values can potentially be every dimensional value, which could be 200 columns.
My general process is this:
- load data to a staging table.
-
match the data with a second IdMatch table that holds only the key values and surrogate IDS.
IF OBJECT_ID('Staging.myTest') IS NULL CREATE TABLE Staging.myTest ( [ID] INT IDENTITY(1,1) NOT NULL, [Hash] INT NULL, [Dim_1] NVARCHAR(32) NOT NULL, [Dim_2] NVARCHAR(32) NOT NULL, [Dim_3] NVARCHAR(32) NULL, [Met_1] INT NULL, [Met_2] DECIMAL(5,2) NULL ); IF OBJECT_ID('IdMatch.myTest') IS NULL CREATE TABLE IdMatch.myTest ( [ID] INT IDENTITY(1,1) NOT NULL, [Hash] INT NULL, [Dim_1] NVARCHAR(32) NOT NULL, [Dim_2] NVARCHAR(32) NOT NULL, [Dim_3] NVARCHAR(32) NULL, ); TRUNCATE TABLE Staging.myTest; TRUNCATE TABLE IdMatch.myTest; INSERT INTO Staging.myTest ([Dim_1], [Dim_2], [Dim_3]) VALUES ('A', 'A', 'A'), ('B', 'B', 'B'), ('C', 'C', NULL), ('C', 'C', 'C'), ('D', 'D', 'D'); INSERT INTO IdMatch.myTest ([Dim_1], [Dim_2], [Dim_3]) VALUES ('A', 'A', 'A'); --My Proc (as script) for setting the index. INSERT INTO [IdMatch].myTest ([Dim_1], [Dim_2], [Dim_3]) SELECT src.[Dim_1], src.[Dim_2], src.[Dim_3] FROM Staging.myTest AS src WHERE NOT EXISTS ( SELECT tgt.[Dim_1], tgt.[Dim_2], tgt.[Dim_3] FROM [IdMatch].myTest AS tgt WHERE tgt.[Dim_1] = src.[Dim_1] AND tgt.[Dim_2] = src.[Dim_2] AND tgt.[Dim_3] = src.[Dim_3] ); SELECT * FROM IdMatch.myTest
THE PROBLEM:
Matching to get a surrogate ID this way takes a long time when the real data-set contains 200+ columns of NVARCHAR data. Is there a better way? I've tried pre-computing hashes, but am not sure how to handle the eventual collisions I'll produce.
Best Answer
One common approach is to choose a hash function with a very, very small chance of collision and then assume there won't be any:
Note: Most people replace column NULLs with empty string for hashing, which is the default behaviour of
CONCAT
. If you need to differentiate NULL from empty string, you would need to identify some other magic value to use, and wrap the nullable column withISNULL
orCOALESCE
.Adding rows that don't match is then:
See Finding rows that have changed in T-SQL – CHECKSUM, BINARY_CHECKSUM, HASHBYTES by Greg Low. You should test this approach with your data to see if this scheme will work for you.