Sql-server – Most efficient way to identify records with any number of key columns

database-designsql serversql server 2014

I work with large third party data-sets. Long experience has taught me that it is a really good idea to give incoming lines of data a surrogate ID as soon as it enters my system so I can track it easily as it's validated, warehoused, etc. The problem is the key values can potentially be every dimensional value, which could be 200 columns.

My general process is this:

  1. load data to a staging table.
  2. match the data with a second IdMatch table that holds only the key values and surrogate IDS.

    IF OBJECT_ID('Staging.myTest') IS NULL
        CREATE TABLE Staging.myTest (
            [ID] INT IDENTITY(1,1) NOT NULL,
            [Hash] INT NULL,
            [Dim_1] NVARCHAR(32) NOT NULL,
            [Dim_2] NVARCHAR(32) NOT NULL,
            [Dim_3] NVARCHAR(32) NULL,
            [Met_1] INT NULL,
            [Met_2] DECIMAL(5,2) NULL
        );
    
    IF OBJECT_ID('IdMatch.myTest') IS NULL
        CREATE TABLE IdMatch.myTest (
            [ID] INT IDENTITY(1,1) NOT NULL,
            [Hash] INT NULL,
            [Dim_1] NVARCHAR(32) NOT NULL,
            [Dim_2] NVARCHAR(32) NOT NULL,
            [Dim_3] NVARCHAR(32) NULL,
        );
    
    TRUNCATE TABLE Staging.myTest;
    TRUNCATE TABLE IdMatch.myTest;
    
    INSERT INTO Staging.myTest
        ([Dim_1], [Dim_2], [Dim_3])
    VALUES ('A', 'A', 'A'),
        ('B', 'B', 'B'),
        ('C', 'C', NULL),
        ('C', 'C', 'C'),
        ('D', 'D', 'D');
    
    INSERT INTO IdMatch.myTest
        ([Dim_1], [Dim_2], [Dim_3])
    VALUES ('A', 'A', 'A');
    
    --My Proc (as script) for setting the index.
    INSERT INTO [IdMatch].myTest
        ([Dim_1], [Dim_2], [Dim_3])
    SELECT src.[Dim_1], src.[Dim_2], src.[Dim_3]
    FROM Staging.myTest AS src
    WHERE NOT EXISTS (
            SELECT tgt.[Dim_1], tgt.[Dim_2], tgt.[Dim_3]
            FROM [IdMatch].myTest AS tgt
            WHERE tgt.[Dim_1] = src.[Dim_1]
                AND tgt.[Dim_2] = src.[Dim_2]
                AND tgt.[Dim_3] = src.[Dim_3]
        );
    
    SELECT * FROM IdMatch.myTest
    

THE PROBLEM:
Matching to get a surrogate ID this way takes a long time when the real data-set contains 200+ columns of NVARCHAR data. Is there a better way? I've tried pre-computing hashes, but am not sure how to handle the eventual collisions I'll produce.

Best Answer

One common approach is to choose a hash function with a very, very small chance of collision and then assume there won't be any:

CREATE TABLE Staging.myTest (
    [ID] INT IDENTITY(1,1) NOT NULL,
    [Hash] AS 
        CONVERT(binary(32), 
            HASHBYTES('SHA2_256', 
                CONCAT(Dim_1, N'|', Dim_2, N'|', Dim_3))),
    [Dim_1] NVARCHAR(32) NOT NULL,
    [Dim_2] NVARCHAR(32) NOT NULL,
    [Dim_3] NVARCHAR(32) NULL,
    [Met_1] INT NULL,
    [Met_2] DECIMAL(5,2) NULL
);
GO
CREATE TABLE IdMatch.myTest (
    [ID] INT IDENTITY(1,1) NOT NULL,
    [Hash] AS 
        CONVERT(binary(32), 
            HASHBYTES('SHA2_256', 
                CONCAT(Dim_1, N'|', Dim_2, N'|', Dim_3))),
    [Dim_1] NVARCHAR(32) NOT NULL,
    [Dim_2] NVARCHAR(32) NOT NULL,
    [Dim_3] NVARCHAR(32) NULL,
);
GO
-- Declared unique because we have decided it will be
CREATE UNIQUE NONCLUSTERED INDEX 
    IX_HASH 
ON IdMatch.myTest 
    ([Hash]);

Note: Most people replace column NULLs with empty string for hashing, which is the default behaviour of CONCAT. If you need to differentiate NULL from empty string, you would need to identify some other magic value to use, and wrap the nullable column with ISNULL or COALESCE.

Adding rows that don't match is then:

INSERT Staging.myTest
(
    Dim_1, 
    Dim_2, 
    Dim_3
)
SELECT
    SRC.Dim_1,
    SRC.Dim_2,
    SRC.Dim_3
FROM Staging.myTest AS SRC
WHERE
    NOT EXISTS
    (
        SELECT 1
        FROM IdMatch.myTest AS TGT
        WHERE
            TGT.[Hash] = SRC.[Hash]
    );

See Finding rows that have changed in T-SQL – CHECKSUM, BINARY_CHECKSUM, HASHBYTES by Greg Low. You should test this approach with your data to see if this scheme will work for you.