Sql-server – Byte ordering for multibyte characters in SQL Server versus Oracle

encodinghashingoraclesql serverunicode

I am currently in the process of migrating data from Oracle to SQL Server and I'm encountering an issue trying to validate the data post-migration.

Environment Details:

  • Oracle 12 – AL32UTF8 character set
  • Client – NLS_LANG – WE8MSWIN1252
  • VARCHAR2 field

SQL Server 2016

  • Latin1_General_CI_AS collation
  • NVARCHAR field

I'm using DBMS_CRYPTO.HASH on Oracle to generate a checksum of the whole row, then copying to SQL and using HASHBYTES to generate a checksum of the whole row, which I'm then comparing to validate the data matches.

The checksums match for all rows, except those with multibyte characters.

For example, rows with this character: ◦ do not match in the checksums, even though the data is transferred correctly. When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character.

In SQL Server, the bytes are 0xE625 and in Oracle they are 0x25E6.

Why are they ordered differently, and is there a reliable way to convert one to the other to ensure the checksum at the other end matches for strings with multi-byte characters?

Best Answer

The collation of an NVARCHAR / NCHAR / NTEXT column has no bearing on the encoding used to store the data in that column. NVARCHAR data is always UTF-16 Little Endian (LE). The collation of NVARCHAR data only impacts sorting and comparison. Collation does impact the encoding of VARCHAR data since the collation determines the code page used to store the data in that column / variable / literal, but we aren't dealing with that here.

As sepupic mentioned, what you are seeing when you view the data in binary form is a difference in endianness (Oracle is using Big Endian while SQL Server is using Little Endian). HOWEVER, what you are seeing when you view the binary form of the string in Oracle is not how the data is actually being stored. You are using AL32UTF8 which is UTF-8, which encodes that character in 3 bytes, not 2, as: E2, 97, A6.

Also, it is not possible for the hashes to be the same for rows of just "a" but not when they include "◦", not unless the hashing in Oracle was done without conversion, hence using the UTF-8 encoding, and the hashing in SQL Server accidentally converting to VARCHAR first. Otherwise there is no hash algorithm that will behave as you are describing, as you can verify by running the following in SQL Server:

DECLARE @Algorithm NVARCHAR(50) = N'MD4';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'MD5';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA1';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_256';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_512';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);

In Oracle, you should use the CONVERT function to get the string into the AL16UTF16LE encoding, and then hash that value. That should match up to what SQL Server has. For example, you can see the different encoding forms of White Bullet (U+25E6) and how using CONVERT along with AL16UTF16LE will correct this on dbfiddle and below:

SELECT DUMP(CHR(14849958), 1016) AS "UTF8",
       DUMP(CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT(CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;

SELECT DUMP('a' || CHR(14849958), 1016) AS "UTF8",
       DUMP('a' || CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT('a' || CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;

That returns:

UTF8:     Typ=1 Len=3 CharacterSet=AL32UTF8: e2,97,a6
UTF16BE:  Typ=1 Len=2 CharacterSet=AL16UTF16: 25,e6
UTF16LE:  Typ=1 Len=2 CharacterSet=AL16UTF16: e6,25


UTF8:     Typ=1 Len=4 CharacterSet=AL32UTF8: 61,e2,97,a6
UTF16BE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 0,61,25,e6
UTF16LE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 61,0,e6,25

As you can see in the 3rd column, the character set is misreported as being Big Endian when it is clearly Little Endian based on the order of the two bytes. You can also see that both characters are two bytes in UTF-16, and the order of both of them is different between Big and Little Endian, not just the characters that are > 1 byte in UTF-8.

Given all of this, since the data is being stored as UTF-8 yet you are seeing it as UTF-16 Big Endian via the DUMP function, it seems like you are already converting it to UTF-16, but probably not realizing that the default UTF-16 in Oracle is Big Endian.

Looking at the "UTF-16" definition on the Glossary page of the Oracle documentation, it states (I broke the following sentences into two parts so it would be easier to distinguish between BE and LE):

AL16UTF16 implements the big-endian encoding scheme of the UTF-16 encoding form (more significant byte of each code unit comes first in memory). AL16UTF16 is a valid national character set.

and:

AL16UTF16LE implements the little-endian UTF-16 encoding scheme. It is a conversion-only character set, valid only in character set conversion functions such as SQL CONVERT or PL/SQL UTL_I18N.STRING_TO_RAW.

P.S. Since you are using AL32UTF8 in Oracle, you should be using the Latin1_General_100_CI_AS_SC collation in SQL Server, not Latin1_General_CI_AS. The one you are using is older and doesn't fully support Supplementary Characters (no data loss if they exist, but built-in functions handle them as 2 characters instead of a single entity).