Sql-server – Why isn’t Unicode character REPLACEd in some cases

collationsql servert-sqlunicode

Running this results in "test", as expected:

SELECT
    REPLACE(NCHAR(1234), NCHAR(1234), N'test');

However, running this results in "aӒa", which doesn't have "test" in it:

SELECT
    REPLACE(N'a' + NCHAR(1234) + N'a', NCHAR(1234), N'test');

I thought it might have to do with the string concatenation happening for the haystack but not the needle, but when I tried this, it still didn't "work":

SELECT
    REPLACE(N'a' + NCHAR(1234) + N'a', N'' + NCHAR(1234) + N'', N'test');

Result: "aӒa"

I suspected it might have to do with how it's interpreting the characters, so I tried specifying a binary collation… and that "fixed" the issue:

SELECT
    REPLACE(N'a' + NCHAR(1234) + N'a' COLLATE Latin1_General_100_BIN2, NCHAR(1234), N'test');

Result: "atesta".

Why?

This behavior seems to exist for some characters, but not others.

SELECT
    REPLACE(N'a' + NCHAR(23423) + N'a', NCHAR(23423), N'test');

Result: "atesta" ("works")

SELECT
    REPLACE(N'a' + NCHAR(5342) + N'a', NCHAR(5342), N'test');

Result: "aᓞa" (doesn't "work")

Why?

Best Answer

This behavior is simply due to the fact that you are executing the query in a Database that has a pre-"100" series default Collation, in which case a great many characters do not have any sort weight. No sort weight means that they equate to empty string. They have a value of 0. So they always equal each other when nothing else having a sort weight is involved. They also equate to empty string. The version 100 Collations (starting in SQL Server 2008 which is version 10.0 or 100 without the decimal / minor version) added sort weights for most characters. So, just force the collation by adding COLLATE Latin1_General_100_CI_AS_SC. A binary Collation of any version (one ending in _BIN2, don't use Collations ending in just _BIN) also works as there are no such things as sort weights in binary Collations.

SELECT 1 WHERE NCHAR(1234) = '' COLLATE SQL_Latin1_General_CP1_CI_AS;
-- 1

SELECT 2 WHERE NCHAR(1234) = '' COLLATE Latin1_General_CI_AS;
-- 2

SELECT 3 WHERE NCHAR(1234) = '' COLLATE Latin1_General_100_CI_AS;
-- (no results)

SELECT 4 WHERE NCHAR(1234) = '' COLLATE Latin1_General_BIN2;
-- (no results)

Even multiple characters with no weight still equate to empty string (or any number of characters with no sort weight):

SELECT 5 WHERE NCHAR(1234) + NCHAR(1234) + NCHAR(1234)
       = N'' COLLATE Latin1_General_CS_AS_KS_WS;
-- 5

In this next test, NCHAR(1234) equates to itself only when using a version 100 (or newer) Collation. Pre-version 100 Collations assign no sort weight to NCHAR(1234) such that it doesn't equate to anything, hence it was previously not found in the first expression.

SELECT REPLACE(N'a' + NCHAR(1234) + N'a',
               NCHAR(1234) COLLATE Latin1_General_100_CI_AS,
               N'test');
-- atesta

Continuing example # 5 above, we can prove that 3 characters having no weight equate to 2 different characters that also have no weight:

SELECT 6 WHERE NCHAR(1234) + NCHAR(1234) + NCHAR(1234)
             = NCHAR(5342) + NCHAR(5342) COLLATE Latin1_General_CS_AS_KS_WS;
-- 6

But switching to a version 100 Collation changes that:

SELECT 7 WHERE NCHAR(1234) + NCHAR(1234) + NCHAR(1234)
             = NCHAR(5342) + NCHAR(5342) COLLATE Latin1_General_100_CS_AS_KS_WS;
-- (no results)

SELECT 8 WHERE NCHAR(1234) = NCHAR(5342) COLLATE Latin1_General_100_CS_AS_KS_WS;
-- (no results)

Please see the "Supplementary Characters" characters section of the following post of mine. In that section I show similar behavior as it relates to supplementary characters, and at the end of the section I show how many BMP characters (i.e. non-supplementary characters) match one of these "missing weights" characters per Collation version.

The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)

For more info on working with Collations, please visit: Collations.info