Sql-server – Use a scalar hashing function in a computed column – non-deterministic

computed-columnhashingsql server

I have a scalar function

select [dbo].[fn_get_hash_fields] ('blah','dbo')

to which is passed a table and a schema
and which returns an expression representing a concatenation of all column values in each row. For example, if the table had the 4 columns below, the function would return (CR's added for readability):

ISNULL(RTRIM([string1]),'')
+ISNULL(RTRIM(CONVERT(varchar(10),[number1])),'')
+ISNULL(RTRIM(CONVERT(varchar(10),[number2])),'')
+ISNULL(RTRIM(CONVERT(varchar(10),number3])),'')

Using this function, I then want to add a computed column that will store a 32 byte hash value representing all the values in each row:

alter table dbo.blah add [checksum] as 
(substring(lower(convert(varchar(32), HASHBYTES('MD5',
[dbo].[fn_get_hash_fields] ('blah','dbo')
),1)),3,32)) 

The column is added, but the hash value is exactly the same for each row.

How do I convince SQLS that the function is non-deterministic and make the function evaluated for each row, computing different hash values?

I know that in the computed column I can simply use the expression returned by the function, rather than a call to the function, and that works. But its not dynamic. If I add a column to the table, it won't be represented in the hash until I drop the computed column and add an updated one that takes the new column into account.

Is there a way around this?
Is there a better approach to store a hash in each row built from all the columnar values in that row?

Here is the function:

CREATE function [dbo].[fn_get_hash_fields] (@p_table_name 
VARCHAR(100),@p_schema_name VARCHAR(20))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @szSqlString as varchar(max)
SET @szSqlString = ''    
      SELECT @szSqlString = @szSqlString +
            CASE DATA_TYPE 
                    WHEN 'int'           THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '])),'''')'
                    WHEN 'tinyint'       THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '])),'''')'
                    WHEN 'smallint'      THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '])),'''')'
                    WHEN 'bigint'        THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '],112)),'''')'  
                    WHEN 'datetime2'     THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '])),'''')'
                    WHEN 'datetime'      THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '],112)),'''')'
                    WHEN 'smalldatetime' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '],112)),'''')'
                    WHEN 'date'          THEN 'ISNULL(RTRIM(CONVERT(varchar(10),[' + COLUMN_NAME + '],112)),'''')' 
                    WHEN 'bit'           THEN 'ISNULL(RTRIM(CONVERT(varchar(1),[' + COLUMN_NAME + '])),'''')'  
                    WHEN 'decimal'       THEN 'ISNULL(RTRIM(CONVERT(varchar('+ CONVERT(varchar(2),NUMERIC_PRECISION) +'),[' + COLUMN_NAME + '])),'''')' 
                    WHEN 'numeric'       THEN 'ISNULL(RTRIM(CONVERT(varchar('+ CONVERT(varchar(2),NUMERIC_PRECISION) +'),[' + COLUMN_NAME + '])),'''')' 
                    ELSE                      'ISNULL(RTRIM([' + COLUMN_NAME + ']),'''')'
                 END + '+'
            FROM   INFORMATION_SCHEMA.COLUMNS
            WHERE TABLE_SCHEMA = @p_schema_name 
            and TABLE_NAME = @p_table_name                
            AND (COLUMN_NAME != 'HASHKEYID')
            AND [DATA_TYPE] NOT IN ( 'TEXT', 'IMAGE')

    RETURN LEFT(ISNULL(@szSqlString, ''),LEN(@szSqlString)-1)
END  

Best Answer

How do I convince SQL Server that the function is nondeterministic and make the function evaluate for each row, computing different hash values?

SQL Server marks the function nondeterministic, because it accesses data in system tables:

SELECT 
    UserDataAccess = OBJECTPROPERTYEX(FN.ObjectID, 'UserDataAccess'),
    SystemDataAccess = OBJECTPROPERTYEX(FN.ObjectID, 'SystemDataAccess'),
    IsDeterministic = OBJECTPROPERTYEX(FN.ObjectID, 'IsDeterministic')
FROM (VALUES(OBJECT_ID(N'[dbo].[fn_get_hash_fields]', 'FN'))) AS FN (ObjectID);

Result 1

Nevertheless, the function actually does return the same value for each row in the table, because nothing in the function's query depends on anything in the source row. SQL Server does execute the function for each row (something to be avoided, in general for performance reasons), but each execution returns the same result.

The function returns a string, which would need to be executed (with SELECT and FROM <table_name> added in the right places) to return a result. Even then, it would return a value for every row of the table, because there is nothing in the constructed SQL to link the generated text to a particular row.

More to the point, it is not even possible to execute dynamic SQL in the context of a function (unsafe hacks aside).

Is there a way around this? Is there a better approach to store a hash in each row built from all the columnar values in that row?

This isn't as straightforward as one might think at first glance.

For example, given:

CREATE TABLE dbo.Test
(
    c1 integer PRIMARY KEY,
    c2 varchar(10) NOT NULL,
    c3 datetime NOT NULL
);
GO
INSERT dbo.Test 
    (c1, c2, c3)
VALUES 
    (1, 'ABCDE', GETDATE()),
    (2, 'FGHIJ', GETUTCDATE());

The following query returns the broad sort of thing needed:

SELECT 
    T.c1,
    T.c2,
    T.c3, 
    Chk = BINARY_CHECKSUM(*) -- or CHECKSUM
FROM dbo.Test AS T;

Result 2

Unfortunately, the Chk column cannot be added as a computed column in the table:

Msg 1789, Level 16, State 1, Procedure ComputeChecksum, Line 5
Cannot use CHECKSUM(*) in a computed column, constraint, or default definition.

It cannot be used directly in a function either, for a variety of reasons depending on exactly how you try it, but one entertaining result is the CHECKSUM(*) becomes recursive (referencing itself) and execution terminates when the depth limit of 32 is reached.

A trigger implementation of an all-columns checksum/hash is probably possible, but would need to contain dynamic SQL (or some XML trickery) because CHECKSUM(*) would include the checksum/hash column itself. Aside from DML triggers, a DDL trigger would also be needed to populate the values when a new column is added. This might be an interesting exercise, but it is not trivial.

If the restrictions are relaxed so the checksum/hash does not need to be in the same table, more options become available. For example, a view:

CREATE VIEW dbo.TestChk
AS
SELECT T.*, Chk = BINARY_CHECKSUM(*) 
FROM dbo.Test AS T;

Though this looks like it would dynamically adapt to changes in the base table, in fact it does not. You will get wrong results if a column is added to the table, for example. When the table schema is changed, the view would have to be refreshed:

-- New column
ALTER TABLE dbo.Test 
ADD c4 decimal(5,2) NOT NULL DEFAULT 0.0;

-- Refresh required
EXECUTE sys.sp_refreshsqlmodule 
    @name = N'dbo.TestChk', 
    @namespace = DEFAULT;

Note the view also cannot be materialized, because it cannot be schema-bound:

Msg 1054, Level 15, State 7, Procedure TestChk, Line 4
Syntax '*' is not allowed in schema-bound objects.

I haven't tried every possible implementation option (including a SQLCLR function), because even if it can be made to work as you want, it is very likely to be horribly inefficient, and not the 'best' solution to the problem you really need to solve.

We would need to know the reasons for wanting this column, as it is quite likely there is a better way to achieve the end result you need via a different implementation option, possibly a standard feature in the product. I would suggest asking a new question for that.