Sql-server – Choosing the right algorithm in HashBytes function

hashingsql serversql-server-2008-r2t-sql

We need to create hash value of nvarchar data for comparison purposes. There are multiple hash algorithms available in T-SQL, but which one the best to choose from in this scenario?

We want to ensure the risk of having duplicate hash value for two different nvarchar value is the minimum. Based on my research on the internet MD5 seems the best one. Is that right? MSDN tells us (link below) about the available algorithms, but no description on which one for what conditions?

HASHBYTES (Transact-SQL)

We need to join two tables on two nvarchar(max) columns. As you can imagine the query takes along time to execute. We thought it would be better to keep the hash value of each nvarchar(max) data and do the join on the hash values rather than the nvarchar(max) values which are blobs. The question is which hash algorithm provides the uniqueness, so that we don't run into the risk of having one hash value for more than one nvarchar(max).

Best Answer

The HASHBYTES function only takes up to 8000 bytes as input. Because your inputs are potentially larger than that, duplicates in the range of the field that gets hashed will cause collisions, regardless of the algorithm chosen. Carefully consider the range of data you plan to hash -- using the first 4000 characters is the obvious choice, but may not be the best choice for your data.

In any event, because of what a hash function is, even if the inputs are 8000 bytes or less, the only way to ensure 100% correctness in the results is to compare the base values at some point (read: not necessarily first). Period.

The business will dictate whether or not 100% accuracy is required. This will tell you that either (a) comparing the base values is required, or (b) you should consider not comparing the base values -- how much accuracy should be traded off for performance.

While hash collisions are possible in a unique input set, they are infinitesimally rare, regardless of the algorithm chosen. The whole idea of using a hash value in this scenario is to efficiently narrow down the join results to a more manageable set, not to necessarily arrive at the final set of results immediately. Again, for 100% accuracy, this cannot be the final step in the process. This scenario isn't using hashing for the purpose of cryptography, so an algorithm such as MD5 will work fine.

It would be extremely hard for me to justify moving up to a SHA-x algorithm for "accuracy" purposes because if the business is going to freak out about the miniscule collision possibilities of MD5, chances are they're also going to freak out that the SHA-x algorithms aren't perfect either. They either have to come to terms with the slight inaccuracy, or mandate that the query be 100% accurate and live with the associated technical implications. I suppose if the CEO sleeps better at night knowing you used SHA-x instead of MD5, well, fine; it still doesn't mean much from a technical point of view in this case.

Speaking of performance, if the tables are read-mostly and the join result is needed frequently, consider implementing an indexed view to eliminate the need to compute the entire join every time it's requested. Of course you trade off storage for that, but it may be well worth it for the performance improvement, particularly if 100% accuracy is required.

For further reading on indexing long string values, I published an article that walks through an example of how to do this for a single table, and presents things to consider when attempting the full scenario in this question.

Related Solutions

Sql-server – Are there ways to only replace SQL Server stored procedures when the definition has changed

We recently had this discussion at my workplace.

First, I want to commend you for doing the "right thing" by using HASHBYTES() over CHECKSUM() to detect changes. Microsoft specifically cautions against using CHECKSUM(@input) for this purpose as its collision rate is very high compared to that of HASHBYTES('SHA1', @input). One advantage CHECKSUM() does have, though, is that there is no (obvious) restriction on the size of its input.

Second, if you use HASHBYTES() I recommend using SHA1 as your hash algorithm. Of the available options SHA1 has the lowest collision rate, and speed is not a concern for your use case.

Finally, To use HASHBYTES() against inputs larger than 8000 bytes you'll have to:

Split your input into 8000 byte chunks.
Hash each chunk.
Somehow combine the resulting hashes and hash them to get your final output.

You can do this in one of two ways:
1. Convert your hash outputs into strings, concatenate them, and hash the result.
2. Stick all your hash outputs into a memory table and take their aggregate checksum using CHECKSUM_AGG().
Encapsulate this work as a function that takes NVARCHAR(MAX) as its input.

All that said, it is all-around simpler to just compare the proc definitions directly using OBJECT_DEFINITION() as gbn suggested, or to simply push all definitions out everywhere as often as you like, as Mike suggested.

I wonder what kind of environment would significantly benefit from a process that deployed only changed procedures and used hashes to avoid copying around and comparing full definitions. You'd need to have a lot of procedures to keep in sync.

Sql-server – How to implement a Set based algorithm/UDF

With regards to methodology, I believe you are barking up the wrong b-tree ;-).

What we know:

First, let's consolidate and review what we know about the situation:

Somewhat complex calculations need to be performed:
- This needs to happen on every row of this table.
- The algorithm changes frequently.
- The algorithm ... [uses] values from some columns to manipulate other columns
- Current processing time is: 7 hours
The table:
- contains 800,000 rows.
- has 38 columns.
The application back-end:
- is Microsoft Excel.
- uses VBA (Visual Basic for Applications)^** as the language.
Database is SQL Server 2014, Enterprise Edition.

There is a Stored Procedure that is called for every row:

This takes 50 ms (on avg, I assume) to run.
It returns approximately 4000 rows.

The definition (at least in part) is:

SELECT AVG([AD_Sensor_Data])
           OVER (ORDER BY [RowID] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING)
           as 'AD_Sensor_Data'
FROM   [AD_Points]
WHERE  [FileID] = @FileID
ORDER BY [RowID] ASC

What we can surmise:

Next, we can look at all of these data points together to see if we can synthesize additional details that will help us find one or more bottle necks, and either point towards a solution, or at least rule some possible solutions out.

The current direction of thought in the comments is that the major issue is data transfer between SQL Server and Excel. Is that really the case? If the Stored Procedure is called for each of the 800,000 rows and takes 50 ms per each call (i.e. per each row), that adds up to 40,000 seconds (not ms). And that is equivalent to 666 minutes (hhmm ;-), or just over 11 hours. Yet the whole process was said to take only 7 hours to run. We are already 4 hours over the total time, and we have even added in time to do the calculations or save the results back to SQL Server. So something is not right here.

Looking at the definition of the Stored Procedure, there is only an input parameter for @FileID; there isn't any filter on @RowID. So I suspect that one of the following two scenarios is happening:

This Stored Procedure does not actually get called per each row, but instead per each @FileID, which appears to span approximately 4000 rows. If the stated 4000 rows returned is a fairly consistent amount, then there are only 200 of those grouping in the 800,000 rows. And 200 executions taking 50 ms each amounts to only 10 seconds out of that 7 hours.
If this stored procedure actually does get called for every row, then wouldn't the first time a new @FileID is passed in take slightly longer to pull new rows into the Buffer Pool, but then the next 3999 executions would typically return faster due to already being cached, right?

I think that focusing on this "filter" Stored Procedure, or any data transfer from SQL Server to Excel, is a red herring.

For the moment, I think the most relevant indicators of lackluster performance are:

There are 800,000 rows
The operation works on one row at a time
The data is being saved back to SQL Server, hence "[uses] values from some columns to manipulate other columns" [ my emphasis ;-) ]

I suspect that:

while there is some room for improvement on the data retrieval and calculations, making those better wouldn't amount to a significant reduction in processing time.
the major bottleneck is issuing 800,000 separate UPDATE statements, which is 800,000 separate transactions.

My recommendation (based on currently available information):

Your biggest area of improvement would be to update multiple rows at one time (i.e. in one transaction). You should update your process to work in terms of each FileID instead of each RowID. So:
1. read in all 4000 rows of a particular FileID into an array
2. the array should contain elements representing the fields being manipulated
3. cycle through the array, processing each row as you currently do
4. once all rows in the array (i.e. for this particular FileID) have been calculated:
  1. start a transaction
  2. call each update per each RowID
  3. if no errors, commit the transaction
  4. if an error occurred, rollback and handle appropriately
If your clustered index isn't already defined as (FileID, RowID) then you should consider that (as @MikaelEriksson suggested in a comment on the Question). It won't help these singleton UPDATEs, but it would at least slightly improve the aggregate operations, such as what you are doing in that "filter" stored procedure since they are all based on FileID.
You should consider moving the logic to a compiled language. I would suggest creating a .NET WinForms app or even Console App. I prefer Console App as it is easy to schedule via SQL Agent or Windows Scheduled Tasks. It shouldn't matter whether it is done in VB.NET or C#. VB.NET might be a more natural fit for your developer, but there will still be some learning curve.

I don't see any reason at this point to move to SQLCLR. If the algorithm changes frequently, that would get annoying have to re-deploy the Assembly all of the time. Rebuilding a Console App and having the .exe get placed into the proper shared folder on the network such that you just run the same program and it just happens to always be up-to-date, should be fairly easy to do.

I don't think moving the processing fully into T-SQL would help if the problem is what I suspect and you are just doing one UPDATE at a time.
If the processing is moved into .NET, you can then make use of Table-Valued Parameters (TVPs) such that you would pass the array into a Stored Procedure that would call an UPDATE that JOINs to the TVP table variable and is hence a single transaction. The TVP should be faster than doing 4000 INSERTs grouped into a single transaction. But the gain coming from using TVPs over 4000 INSERTs in 1 transaction likely won't be as significant as the improvement seen when moving from 800,000 separate transactions to only 200 transactions of 4000 rows each.

The TVP option is not natively available for the VBA side, but someone came up with a work-around that might be worth testing:

How do I improve the database performance when going from VBA to SQL Server 2008 R2?
IF the filter proc is only using FileID in the WHERE clause, and IF that proc is really being called per every row, then you can save some processing time by caching the results of the first run and using them for the rest of the rows per that FileID, right?
Once you get the processing done per FileID, then we can start talking about parallel processing. But that might not be necessary at that point :). Given that you are dealing with 3 fairly major non-ideal parts: Excel, VBA, and 800k transactions, any talk of SSIS, or parallelograms, or who-knows-what, is premature optimization / cart-before-the-horse type stuff. If we can get this 7 hour process down to 10 minutes or less, would you still be thinking of additional ways to make it faster? Is there a target completion time that you have in mind? Keep in mind that once processing is done on a per FileID basis, if you had a VB.NET Console App (i.e. command-line .EXE), there would be nothing stopping you from running a few of those FileIDs at a time :), whether via SQL Agent CmdExec step or Windows Scheduled Tasks, etc.

AND, you can always take a "phased" approach and make a few improvements at a time. Such as starting with doing the updates per FileID and hence using one transaction for that group. Then, see if you can get the TVP working. Then see about taking that code and moving it to VB.NET (and TVPs work in .NET so it will port nicely).

What we do not know that could still help:

Does the "filter" Stored Procedure run per RowID or per FileID? Do we even have the full definition of that Stored Procedure?
Full schema of the table. How wide is this table? How many variable length fields are there? How many fields are NULLable? If any are NULLable, how many contain NULLs?
Indexes for this table. Is it partitioned? Is either ROW or PAGE Compression being used?
How large is this table in terms of MB / GB?
How is index maintenance handled for this table? How fragmented are the indexes? How update to date are the statistics?
Do any other processes write to this table while this 7 hour process is taking place? Possible source of contention.
Do any other processes read from this table while this 7 hour process is taking place? Possible source of contention.

UPDATE 1:

^** There seems to be some confusion about what VBA (Visual Basic for Applications) and what can be done with it, so this is just to make sure we are all on the same web-page:

VBA is not Visual Basic (stand-alone language), nor is it VBScript (works in IIS and Windows Scripting Host). VBA is a reduced form of VB that is specific to Microsoft Office products, to extend their capabilities. Hence, there is no real way to simply "remove" the VBA code from Excel in order to get Excel out of the picture.
Can VBA Code Be Run Outside of MS Applications?
Run VBA outside of Office applications
Run Excel Macro from Outside Excel Using VBScript From Command Line

UPDATE 2:

One more point to consider: How are connections being handled? Is the VBA code opening and closing the Connection per each operation, or does it open the connection at the start of the process and close it at the end of the process (i.e. 7 hours later)? Even with connection pooling (which, by default, should be enabled for ADO), there should still be quite an impact between opening and closing once as opposed to opening and closing either 800,200 or 1,600,000 times. Those values are based on at least 800,000 UPDATEs plus either 200 or 800k EXECs (depending on how often the filter stored procedure is actually being executed).

This issue of too many connections is automatically mitigated by the recommendation I outlined above. By creating a transaction and doing all of the UPDATEs within that transaction, you are going to be keeping that connection open and reusing it for each UPDATE. Whether or not the connection is kept open from the initial call to get the 4000 rows per the specified FileID, or closed after that "get" operation and opened again for the UPDATEs, is far less impacting since we are now talking about a difference of either 200 or 400 total connections across the entire process.

UPDATE 3:

I did some quick testing. Please keep in mind that this is a rather small scale test, and not the exact same operation (pure INSERT vs EXEC + UPDATE). However, the differences in timing related to how connections and transactions are handled are still relevant, hence the information can be extrapolated to having a relatively similar impact here.

Test Parameters:

SQL Server 2012 Developer Edition (64-bit), SP2

Table:

 CREATE TABLE dbo.ManyInserts
 (
    RowID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    InsertTime DATETIME NOT NULL DEFAULT (GETDATE()),
    SomeValue BIGINT NULL
 );

Operation:

INSERT INTO dbo.ManyInserts (SomeValue) VALUES ({LoopIndex * 12});

Total Inserts per each test: 10,000
Resets per each test: TRUNCATE TABLE dbo.ManyInserts; (given the nature of this test, doing the FREEPROCCACHE, FREESYSTEMCACHE, and DROPCLEANBUFFERS didn't seem to add much value.)
Recovery Model: SIMPLE (and maybe 1 GB free in the Log file)
Tests that use Transactions only use a single Connection regardless of how many Transactions.

Results:

Test                                   Milliseconds
-------                                ------------
10k INSERTs across 10k Connections     3968 - 4163
10k INSERTs across 1 Connection        3466 - 3654
10k INSERTs across 1 Transaction       1074 - 1086
10k INSERTs across 10 Transactions     1095 - 1169

As you can see, even if the ADO connection to the DB is already being shared across all operations, grouping them into batches using an explicit transaction (the ADO object should be able to handle this) is guaranteed to significantly (i.e. over 2x improvement) reduce the overall process time.

Best Answer

Related Solutions

Sql-server – Are there ways to only replace SQL Server stored procedures when the definition has changed

Sql-server – How to implement a Set based algorithm/UDF

Related Question