SQL Server – How to Handle Uniqueness of Many Large Columns

sql serversql server 2014unique-constraint

I have the following table:

CREATE TABLE dbo.Document
(
    DocumentId int IDENTITY(1,1) NOT NULL CONSTRAINT PK_DocumentPRIMARY KEY CLUSTERED,
    [Timestamp] datetime2(7) NOT NULL CONSTRAINT DF_Document_Timestamp  DEFAULT (getdate()),
    CreatedBy nvarchar(128) NOT NULL CONSTRAINT DF_Document_CreatedBy  DEFAULT (dbo.getCurrentUser()),
    MonthId int NOT NULL,
    TimeModeId int NOT NULL CONSTRAINT FK_Document_TimeMode REFERENCES usr.TimeMode,
    Key1 bit NOT NULL,
    Key2 int NULL,
    Key3 varchar(max) NULL,   -- sometimes above 8000chars
    Key4 varchar(max) NULL,   -- sometimes above 8000chars
    Key5 varchar(max) NULL,   -- sometimes above 8000chars
    Key6 varchar(max) NULL,   -- sometimes above 8000chars
    Key7 varchar(max) NULL,   -- sometimes above 8000chars
    Key8 int NOT NULL,
    CONSTRAINT FK_Document_BrandType FOREIGN KEY(Key8) REFERENCES dbo.BrandType (Key8),
)

Although I have insisted to find a better natural identifier, I had to deal with the following natural unique tuple:

MonthId, TimeModeId, Key1, ... , Key8

This is way too large for an UNIQUE index (max 900 bytes in SQL Server 2014 or less), so I had to come up with something. My idea is to compute a hash for these columns, so I had a PERSISTED COMPUTED columns as above:

FiltersHash  AS (hashbytes('SHA2_256',(
        (((((((((((((((
            (CONVERT(varchar(10),MonthId)+'|') 
            + CONVERT(varchar(4),TimeModeId))
            +'|')+CONVERT(varchar(4),Key1))
            +'|')+isnull(CONVERT(varchar(max),Key2),''))
            +'|')+isnull(CONVERT(varchar(max),Key3),''))
            +'|')+isnull(CONVERT(varchar(max),Key4),''))
            +'|')+isnull(CONVERT(varchar(max),Key5),''))
            +'|')+isnull(CONVERT(varchar(max),Key6),''))
            +'|')+isnull(CONVERT(varchar(max),Key7),''))
            +'|')+isnull(CONVERT(varchar(4),Key8),''))
        ) PERSISTED,
CONSTRAINT UQ_Document_FiltersHash UNIQUE NONCLUSTERED (FiltersHash),

It proved useful because, through a convoluted scenario, the application tried to duplicate a document.

Question: is this solution a good one or are there simpler or more efficient solutions to the large width uniqueness problem?

Note: In my application, I can safely ignore the collisions (if it ever happens, the consequences are rather small). Thanks to Aaron Bertrand for pointing out.

Best Answer

The chance of a hash collision is pretty astronomical (as discussed elsewhere on Stack Exchange: https://stackoverflow.com/a/4014407). However, you can reduce it further by adding a second key:

...
FiltersHash  AS HASHBYTES('SHA2_256', /* Various fields */) PERSISTED,
KeyPrefixes  AS CAST(Key1 AS CHAR(1) + '|' + CAST(Key2 AS VARCHAR(10))
    + '|' + LEFT(Key3, 100) + '|' + LEFT(Key4, 100)
    + '|' + LEFT(Key5, 100) + '|' + LEFT(Key6, 100)
    + '|' + LEFT(Key7, 100) + '|' + CAST(Key8 AS VARCHAR(10)) PERSISTED
...
CREATE UNIQUE INDEX UQ_Docs_BizKey ON Documents (FiltersHash, KeyPrefixes)

Now two records must match on three fields and the first part of five more. If your data commonly includes vertical pipes, consider an alternative separator. INSERTs into the table will be a little slower, but at your data volumes it's probably not a concern.

As an aside, you're probably better off clustering on MonthID and leaving DocumentID a non-clustered PK, assuming people occasionally run searches by date range ("all documents from June") but rarely search for a range of document IDs.