I just don't get it. See this SQL query:
select nchar(65217) -- ﻁ
select nchar(65218) -- ﻂ
select nchar(65219) -- ﻃ
select nchar(65220) -- ﻄ
if nchar(65217) = nchar(65218)
print 'equal'
if nchar(65217) = nchar(65219)
print 'equal'
if nchar(65217) = nchar(65220)
print 'equal'
Based on transitive relation, it means that SQL Server considers them all to be the same character.
However, in other environments, say for example C#, they're not the same.
What I'm confused about is:
- How string comparison works in SQL Server
- Why comparison doesn't behave the same on one machine, and one platform, but different environments
- These 4 characters represent one human-understandable character. Why they are so abundant in Unicode character map?
This of course results in tremendous problems, because I'm working on a text-processing application and data comes almost from everywhere and I need to normalize text before processing it.
If I know the reason of difference, I might find a solution to handle it.
Thank you.
Best Answer
All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.
SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (
_BIN
suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_
prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robustword-sort
rules that closely align with the Windows OS whereas legacy collations consider only single characters.The example below illustrates the differences between Windows and binary collation with the Teth character:
Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character
ﻁ
is used in different languages (http://en.wikipedia.org/wiki/Teth).