Sql-server – Why these characters are all equal in SQL Server

sql serverunicode

I just don't get it. See this SQL query:

select nchar(65217) -- ﻁ
select nchar(65218) -- ﻂ
select nchar(65219) -- ﻃ
select nchar(65220) -- ﻄ
if nchar(65217) = nchar(65218)
    print 'equal'
if nchar(65217) = nchar(65219)
    print 'equal'
if nchar(65217) = nchar(65220)
    print 'equal'

Based on transitive relation, it means that SQL Server considers them all to be the same character.

However, in other environments, say for example C#, they're not the same.

What I'm confused about is:

  1. How string comparison works in SQL Server
  2. Why comparison doesn't behave the same on one machine, and one platform, but different environments
  3. These 4 characters represent one human-understandable character. Why they are so abundant in Unicode character map?

This of course results in tremendous problems, because I'm working on a text-processing application and data comes almost from everywhere and I need to normalize text before processing it.

If I know the reason of difference, I might find a solution to handle it.
Thank you.

Best Answer

All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.

SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (_BIN suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_ prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort rules that closely align with the Windows OS whereas legacy collations consider only single characters.

The example below illustrates the differences between Windows and binary collation with the Teth character:

CREATE TABLE dbo.WindowsColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
    );

CREATE TABLE dbo.BinaryColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_BIN
    , Character2 nchar(1) COLLATE Arabic_100_BIN
    , Character3 nchar(1) COLLATE Arabic_100_BIN
    , Character4 nchar(1) COLLATE Arabic_100_BIN
    );

INSERT  INTO dbo.BinaryColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT  INTO dbo.WindowsColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );

--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
    character1 = character2
    OR character1 = character3
    OR character1 = character4
    OR character2 = character3
    OR character2 = character4
    OR character3 = character4;

--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;

Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character is used in different languages (http://en.wikipedia.org/wiki/Teth).