Sql-server – Choosing the right algorithm in HashBytes function

hashingsql serversql-server-2008-r2t-sql

We need to create hash value of nvarchar data for comparison purposes. There are multiple hash algorithms available in T-SQL, but which one the best to choose from in this scenario?

We want to ensure the risk of having duplicate hash value for two different nvarchar value is the minimum. Based on my research on the internet MD5 seems the best one. Is that right? MSDN tells us (link below) about the available algorithms, but no description on which one for what conditions?

HASHBYTES (Transact-SQL)

We need to join two tables on two nvarchar(max) columns. As you can imagine the query takes along time to execute. We thought it would be better to keep the hash value of each nvarchar(max) data and do the join on the hash values rather than the nvarchar(max) values which are blobs. The question is which hash algorithm provides the uniqueness, so that we don't run into the risk of having one hash value for more than one nvarchar(max).

Best Answer

The HASHBYTES function only takes up to 8000 bytes as input. Because your inputs are potentially larger than that, duplicates in the range of the field that gets hashed will cause collisions, regardless of the algorithm chosen. Carefully consider the range of data you plan to hash -- using the first 4000 characters is the obvious choice, but may not be the best choice for your data.

In any event, because of what a hash function is, even if the inputs are 8000 bytes or less, the only way to ensure 100% correctness in the results is to compare the base values at some point (read: not necessarily first). Period.

The business will dictate whether or not 100% accuracy is required. This will tell you that either (a) comparing the base values is required, or (b) you should consider not comparing the base values -- how much accuracy should be traded off for performance.

While hash collisions are possible in a unique input set, they are infinitesimally rare, regardless of the algorithm chosen. The whole idea of using a hash value in this scenario is to efficiently narrow down the join results to a more manageable set, not to necessarily arrive at the final set of results immediately. Again, for 100% accuracy, this cannot be the final step in the process. This scenario isn't using hashing for the purpose of cryptography, so an algorithm such as MD5 will work fine.

It would be extremely hard for me to justify moving up to a SHA-x algorithm for "accuracy" purposes because if the business is going to freak out about the miniscule collision possibilities of MD5, chances are they're also going to freak out that the SHA-x algorithms aren't perfect either. They either have to come to terms with the slight inaccuracy, or mandate that the query be 100% accurate and live with the associated technical implications. I suppose if the CEO sleeps better at night knowing you used SHA-x instead of MD5, well, fine; it still doesn't mean much from a technical point of view in this case.

Speaking of performance, if the tables are read-mostly and the join result is needed frequently, consider implementing an indexed view to eliminate the need to compute the entire join every time it's requested. Of course you trade off storage for that, but it may be well worth it for the performance improvement, particularly if 100% accuracy is required.

For further reading on indexing long string values, I published an article that walks through an example of how to do this for a single table, and presents things to consider when attempting the full scenario in this question.