I am currently in the process of migrating data from Oracle to SQL Server and I'm encountering an issue trying to validate the data post-migration.
Environment Details:
- Oracle 12 – AL32UTF8 character set
- Client – NLS_LANG – WE8MSWIN1252
- VARCHAR2 field
SQL Server 2016
- Latin1_General_CI_AS collation
- NVARCHAR field
I'm using DBMS_CRYPTO.HASH on Oracle to generate a checksum of the whole row, then copying to SQL and using HASHBYTES to generate a checksum of the whole row, which I'm then comparing to validate the data matches.
The checksums match for all rows, except those with multibyte characters.
For example, rows with this character: ◦ do not match in the checksums, even though the data is transferred correctly. When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character.
In SQL Server, the bytes are 0xE625 and in Oracle they are 0x25E6.
Why are they ordered differently, and is there a reliable way to convert one to the other to ensure the checksum at the other end matches for strings with multi-byte characters?
Best Answer
The collation of an
NVARCHAR
/NCHAR
/NTEXT
column has no bearing on the encoding used to store the data in that column.NVARCHAR
data is always UTF-16 Little Endian (LE). The collation ofNVARCHAR
data only impacts sorting and comparison. Collation does impact the encoding ofVARCHAR
data since the collation determines the code page used to store the data in that column / variable / literal, but we aren't dealing with that here.As sepupic mentioned, what you are seeing when you view the data in binary form is a difference in endianness (Oracle is using Big Endian while SQL Server is using Little Endian). HOWEVER, what you are seeing when you view the binary form of the string in Oracle is not how the data is actually being stored. You are using
AL32UTF8
which is UTF-8, which encodes that character in 3 bytes, not 2, as:E2, 97, A6
.Also, it is not possible for the hashes to be the same for rows of just "a" but not when they include "◦", not unless the hashing in Oracle was done without conversion, hence using the UTF-8 encoding, and the hashing in SQL Server accidentally converting to
VARCHAR
first. Otherwise there is no hash algorithm that will behave as you are describing, as you can verify by running the following in SQL Server:In Oracle, you should use the
CONVERT
function to get the string into theAL16UTF16LE
encoding, and then hash that value. That should match up to what SQL Server has. For example, you can see the different encoding forms of White Bullet (U+25E6) and how usingCONVERT
along withAL16UTF16LE
will correct this on dbfiddle and below:That returns:
As you can see in the 3rd column, the character set is misreported as being Big Endian when it is clearly Little Endian based on the order of the two bytes. You can also see that both characters are two bytes in UTF-16, and the order of both of them is different between Big and Little Endian, not just the characters that are > 1 byte in UTF-8.
Given all of this, since the data is being stored as UTF-8 yet you are seeing it as UTF-16 Big Endian via the
DUMP
function, it seems like you are already converting it to UTF-16, but probably not realizing that the default UTF-16 in Oracle is Big Endian.Looking at the "UTF-16" definition on the Glossary page of the Oracle documentation, it states (I broke the following sentences into two parts so it would be easier to distinguish between BE and LE):
and:
P.S. Since you are using
AL32UTF8
in Oracle, you should be using theLatin1_General_100_CI_AS_SC
collation in SQL Server, notLatin1_General_CI_AS
. The one you are using is older and doesn't fully support Supplementary Characters (no data loss if they exist, but built-in functions handle them as 2 characters instead of a single entity).