Performance Comparison: Grouping/Sorting with BIGINT vs NVARCHAR in SQL Server

sql servert-sql

I want to store a hash-code for a variable-length text field (max 1000 chars) in a database table. The hash-code will be computed and assigned once on insert, and new rows will be inserted very often.

The hash-code will be used mainly for filtering (WHERE), grouping (GROUP BY), and sorting (ORDER BY) in a couple of queries. The database table will hold a few million rows over time, with the probability of identical hash-codes (for identical text) being around 30% (rest being unique).

I have the choice of making the hash-code data type NVARCHAR (SHA1 of text) or BIGINT (converted bytes of SHA1 of text). I think BIGINT will be better in terms of storage space (less pages).

Generally speaking, which of these two data types will be better in terms of performance, considering the operations mentioned above?

Best Answer

You want to store the hash using the densest datatype possible.

A bigint is 8 bytes, or 64-bits of data. An nvarchar storing 8 bytes of data would take 10 bytes on disk.

But how long are your hashes? SHA-1 hashes are 20 bytes long (160 bits) so won't fit in a bigint. Storing it as an nvarchar will take 42 bytes of storage if stored naively.

According to https://stackoverflow.com/a/16680423/391445 you best option is to use binary(20) so you are storing exactly 20 bytes (160 bits) of data.

Are you intending to do the hashing in the database or in client code? According to the documentation, as of SQL Server 2016 all hash algorithms other than SHA2_256 and SHA2_512 are deprecated.

But a cryptographic hash function is probably not the best choice of hash function for you if all you want is a random distribution of rows. There are many hash functions available, producing whatever length key you desire!

Related Solutions

SQL Server – Differences Between varchar and nvarchar

You need to be sure that you prefix Unicode string literals with an N prefix. For example these will work differently if the underlying data type is NVARCHAR:

CREATE TABLE dbo.t(c NVARCHAR(32));

INSERT dbo.t(c) SELECT 'រៀន';
INSERT dbo.t(c) SELECT 'នរៀ';
INSERT dbo.t(c) SELECT N'រៀន';

SELECT c FROM dbo.t;

SELECT c FROM dbo.t WHERE c = 'រៀន';
SELECT c FROM dbo.t WHERE c = N'រៀន';

Results:

c
----
??? -- not stored correctly
??? -- not stored correctly
រៀន -- stored correctly!

c
----
???
??? -- probably not expected, however all Unicode characters have been changed to ?

c
----
រៀន

For those on mobile devices or decrepit browsers that show box characters instead of actual Unicode characters, this is what it looks like:

enter image description here

T-sql – Comparing INT and NVARCHAR

I solved it myself. In case someone runs into a similar problem: [dbo].[MemoCount] also has an ID Column. so i changed

AND ([Reference] COLLATE SQL_Latin1_General_CP1_CI_AS = [ID]))

to:

AND ([Reference]  = [Foobar].[dbo].[Expense].[ID])) AS [MemoCount]

Apparently SQL took the column [dbo].[MemoUnion].[ID] to compare it with [Reference]. That problem was of course not solvable for you guys because I never posted the Columns of [dbo].[MemoUnion] in the first place. Thanks for all the suggestions nonetheless!

Best Answer

Related Solutions

SQL Server – Differences Between varchar and nvarchar

T-sql – Comparing INT and NVARCHAR

Related Question