Performance Comparison: Grouping/Sorting with BIGINT vs NVARCHAR in SQL Server

sql servert-sql

I want to store a hash-code for a variable-length text field (max 1000 chars) in a database table. The hash-code will be computed and assigned once on insert, and new rows will be inserted very often.

The hash-code will be used mainly for filtering (WHERE), grouping (GROUP BY), and sorting (ORDER BY) in a couple of queries. The database table will hold a few million rows over time, with the probability of identical hash-codes (for identical text) being around 30% (rest being unique).

I have the choice of making the hash-code data type NVARCHAR (SHA1 of text) or BIGINT (converted bytes of SHA1 of text). I think BIGINT will be better in terms of storage space (less pages).

Generally speaking, which of these two data types will be better in terms of performance, considering the operations mentioned above?

Best Answer

You want to store the hash using the densest datatype possible.

A bigint is 8 bytes, or 64-bits of data. An nvarchar storing 8 bytes of data would take 10 bytes on disk.

But how long are your hashes? SHA-1 hashes are 20 bytes long (160 bits) so won't fit in a bigint. Storing it as an nvarchar will take 42 bytes of storage if stored naively.

According to https://stackoverflow.com/a/16680423/391445 you best option is to use binary(20) so you are storing exactly 20 bytes (160 bits) of data.

Are you intending to do the hashing in the database or in client code? According to the documentation, as of SQL Server 2016 all hash algorithms other than SHA2_256 and SHA2_512 are deprecated.

But a cryptographic hash function is probably not the best choice of hash function for you if all you want is a random distribution of rows. There are many hash functions available, producing whatever length key you desire!