Sql-server – Why these characters are all equal in SQL Server

sql serverunicode

I just don't get it. See this SQL query:

select nchar(65217) -- ﻁ
select nchar(65218) -- ﻂ
select nchar(65219) -- ﻃ
select nchar(65220) -- ﻄ
if nchar(65217) = nchar(65218)
    print 'equal'
if nchar(65217) = nchar(65219)
    print 'equal'
if nchar(65217) = nchar(65220)
    print 'equal'

Based on transitive relation, it means that SQL Server considers them all to be the same character.

However, in other environments, say for example C#, they're not the same.

What I'm confused about is:

How string comparison works in SQL Server
Why comparison doesn't behave the same on one machine, and one platform, but different environments
These 4 characters represent one human-understandable character. Why they are so abundant in Unicode character map?

This of course results in tremendous problems, because I'm working on a text-processing application and data comes almost from everywhere and I need to normalize text before processing it.

If I know the reason of difference, I might find a solution to handle it.
Thank you.

Best Answer

All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.

SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (_BIN suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_ prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort rules that closely align with the Windows OS whereas legacy collations consider only single characters.

The example below illustrates the differences between Windows and binary collation with the Teth character:

CREATE TABLE dbo.WindowsColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
    );

CREATE TABLE dbo.BinaryColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_BIN
    , Character2 nchar(1) COLLATE Arabic_100_BIN
    , Character3 nchar(1) COLLATE Arabic_100_BIN
    , Character4 nchar(1) COLLATE Arabic_100_BIN
    );

INSERT  INTO dbo.BinaryColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT  INTO dbo.WindowsColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );

--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
    character1 = character2
    OR character1 = character3
    OR character1 = character4
    OR character2 = character3
    OR character2 = character4
    OR character3 = character4;

--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;

Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character ﻁ is used in different languages (http://en.wikipedia.org/wiki/Teth).

Related Solutions

Sql-server – Are these two queries technically equal to each other

No. But almost.

There are four things that joins do. I wrote about this at: JOIN simplification in SQL Server

Your "NOT IN" (which you should be careful of, regarding NULLs - try using NOT EXISTS) won't duplicate any rows. A LEFT JOIN can. But consider the use of Unique Indexes/Constraints/PKs, which can help the Query Optimizer treat these as identical.

On the whole, use NOT EXISTS rather than LEFT JOIN, because NOT EXISTS will get treated as an Anti Semi-Join, which is slightly better than what a LEFT JOIN / NOT NULL does. I show this at: Joins without JOIN

So... based on my assumption of what your queries were meant to be, the answer is:

Almost. (And how 'almost' depends on the metadata involved)

Sql-server – Returning rows where all non-NULL values are equal

The most compact syntax I can find is:

SELECT * 
FROM @T AS t
WHERE EXISTS 
(
    SELECT ISNULL(ISNULL(a, b), c) 
    INTERSECT 
    SELECT ISNULL(ISNULL(b, c), a) 
    INTERSECT 
    SELECT ISNULL(ISNULL(c, a), b)
);

Based on an idea from one of my old blog posts that describes how to use INTERSECT and EXCEPT to replace comparisons like a <> b OR (a IS NULL AND b IS NULL) with NOT EXISTS (a INTERSECT b).

Best Answer

Related Solutions

Sql-server – Are these two queries technically equal to each other

Sql-server – Returning rows where all non-NULL values are equal

Related Question