Sql-server – Why does searching for LIKE N’%�%’ match any Unicode character and = N’�’ match many

likesql serversql-server-2016unicode

DECLARE @T TABLE(
  Col NCHAR(1));

INSERT INTO @T
VALUES      (N'A'),
            (N'B'),
            (N'C'),
            (N'Ƕ'),
            (N'Ƿ'),
            (N'Ǹ');

SELECT *
FROM   @T
WHERE  Col LIKE N'%�%'

Returns

Col
A
B
C
Ƕ
Ƿ
Ǹ

SELECT *
FROM   @T
WHERE  Col = N'�'

Returns

Col
Ƕ
Ƿ
Ǹ

Generating every possible double byte "character" with the below shows that the = version matches 21,229 of them and the LIKE N'%�%' version all of them (I've tried a few non binary collations with the same result).

WITH T(I, N)
AS 
(
SELECT TOP 65536 ROW_NUMBER() OVER (ORDER BY @@SPID),
                 NCHAR(ROW_NUMBER() OVER (ORDER BY @@SPID))
FROM master..spt_values v1, 
     master..spt_values v2
)
SELECT I, N 
FROM T
WHERE N = N'�'

Anyone able to shed any light as to what is going on here?

Using COLLATE Latin1_General_BIN then matches the single character NCHAR(65533) – but the question is to understand what rules it uses in the other case. What's special about those 21,229 characters that match the = and why does everything match the wildcard like? I presume there is some reason behind it that I am missing.

nchar(65534) [and 21k others] work just as well as nchar(65533). The question could have been phrased using nchar(502) equally as � – it behaves the same both as LIKE N'%Ƕ%' (matches everything) and in the = case. That's probably quite a big clue.

Changing the SELECT in the last query to SELECT I, N, RANK() OVER(ORDER BY N) shows that SQL Server can't rank the characters. It seems that any character not handled by the collation is considered equivalent.

A database with a Latin1_General_100_CS_AS collation produces 5840 matches. Latin1_General_100_CS_AS cuts down the = matches quite considerably, but doesn't change the LIKE behaviour. It does seem like there is a pot of characters that has got smaller in later collations that all compare equal and are ignored in wildcard LIKE searches then.

I am using SQL Server 2016. The symbol � is the Unicode replacement character, but the only invalid characters in the UCS-2 encoding are 55296 – 57343 AFAIK and it is clearly matching perfectly valid code points such as N'Ԛ' that are not in this range.

All these characters behave like the empty string for LIKE and =. They even evaluate as equivalent. N'' = N'�' is true, and you can drop it in a LIKE comparison of single spaces LIKE '_' + nchar(65533) + '_' with no effect. LEN comparisons yield different results though, so it's probably only certain string functions.

I think the LIKE behaviour is correct for this case; it behaves like an unknown value (which could be anything). It happens also for these other characters:

nchar(11217) (Uncertainty Sign)
nchar(65532) (Object Replacement Character)
nchar(65533) (Replacement Character)
nchar(65534) (Not a Character)

So if I want to find all the chars that represent uncertainty with the equal sign I would use a collation that supports supplementary characters like Latin1_General_100_CI_AS_SC.

I guess these are the group of "non weighted characters" mentioned in the documentation, Collation and Unicode Support.

Best Answer

How one "character" (which can be comprised of multiple Code Points: surrogate pairs, combining characters, etc) compares to another is based on a rather complex set of rules. It is so complex due to needing to account for all the various (and sometimes "wacky") rules found in all of the languages represented in the Unicode specification. This system applies to non-binary Collations for all NVARCHAR data, and for VARCHAR data that is using a Windows Collation and not a SQL Server Collation (one starting with SQL_). This system does not apply to VARCHAR data using a SQL Server Collation as those use simple mappings.

Most of the rules are defined in the Unicode Collation Algorithm (UCA). Some of those rules, and some not covered in the UCA, are:

The default ordering / weight given in the allkeys.txt file (noted below)
Which sensitivities and options are being used (e.g. is it case sensitive or insensitive?, and if sensitive, then is it upper-case first or lower-case first?)
Any locale-based overrides.
The version of the Unicode standard is being used.
The "human" factor (i.e. Unicode is a specification, not software, and is thus left up to each vendor to implement it)

I emphasized that final point regarding the human factor to hopefully make it clear that one should not expect SQL Server to always behave 100% according to the specification.

The overriding factor here is the weighting given to each Code Point, and the fact that multiple Code Points can share the same weight specification. You can find the basic weights (no locale-specific overrides) here (I believe the 100 series of Collations is Unicode v 5.0 -- informal confirmation in the comments on the Microsoft Connect item):

http://www.unicode.org/Public/UCA/5.0.0/allkeys.txt

The Code Point in question – U+FFFD – is defined as:

FFFD  ; [*0F12.0020.0002.FFFD] # REPLACEMENT CHARACTER

That notation is defined in section 9.1 Allkeys File Format of the UCA:

<entry>       := <charList> ';' <collElement>+ <eol>
<charList>    := <char>+
<collElement> := "[" <alt> <weight> "." <weight> "." <weight> ("." <weight>)? "]"
<alt>         := "*" | "."

Collation elements marked with a "*" are variable.

That last line is important as the Code Point we are looking at has a specification that indeed starts with "*". In section 3.6 Variable Weighting there are four possible behaviors defined, based on Collation configuration values that we have no direct access to (these are hard-coded into the Microsoft implementation of each Collation, such as whether case-sensitive uses lower-case first or upper-case first, a property which is different between VARCHAR data using SQL_ Collations and all other variations).

I do not have time to do the full research into which paths are taken and to infer which options are being used such that a more solid proof can be given, but it is safe to say that within each Code Point specification, whether or not something is considered "equal" is not going to always use the full specification. In this case, we have "0F12.0020.0002.FFFD" and most likely it is just levels 2 and 3 that are being used (i.e. .0020.0002.). Doing a "Count" in Notepad++ for ".0020.0002." finds 12,581 matches (including supplementary characters that we have not been dealing with yet). Doing a "Count" on "[*" returns 4049 matches. Doing a RegEx "Find" / "Count" using a pattern of \[\*\d{4}\.0020\.0002 returns 832 matches. So somewhere in this combination, plus possibly some other rules I am not seeing, plus some Microsoft-specific implementation details, is the full explanation of this behavior. And to be clear, the behavior is the same for all of the matching characters as they all match each other as they all have the same weight once the rules are applied (meaning, this question could have been asked about any one of them, not necessarily Mr. �).

You can see with the query below and changing the COLLATE clause as per the results below the query how the various sensitivities work across the two versions of Collations:

;WITH cte AS
(
  SELECT     TOP (65536) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1 AS [Num]
  FROM       [master].sys.columns col
  CROSS JOIN [master].sys.objects obj
)
SELECT cte.Num AS [Decimal],
       CONVERT(VARBINARY(2), cte.Num) AS [Hex],
       NCHAR(cte.Num) AS [Character]
FROM   cte
WHERE  NCHAR(cte.Num) = NCHAR(0xFFFD) COLLATE Latin1_General_100_CS_AS_WS --N'�'
ORDER BY cte.Num;

The various counts of matching characters at different collations is below.

Latin1_General_100_CS_AS_WS   =   5840
Latin1_General_100_CS_AS      =   5841 (The "extra" character is U+3000)
Latin1_General_100_CI_AS      =   5841
Latin1_General_100_CI_AI      =   6311

Latin1_General_CS_AS_WS       = 21,229
Latin1_General_CS_AS          = 21,230
Latin1_General_CI_AS          = 21,230
Latin1_General_CI_AI          = 21,537

In all of the collations listed above N'' = N'�' also evaluates to true.

UPDATE

I was able to do a little more research and here is what I found:

How it "probably" should work

Using the ICU Collation Demo, I set the locale to "en-US-u-va-posix", set the strength to "primary", checked show "sort keys", and pasted in the following 4 characters that I copied from the results of the query above (using the Latin1_General_100_CI_AI Collation):

�
Ԩ
ԩ
Ԫ

and that returns:

Ԫ
    60 2E 02 .
Ԩ
    60 7A .
ԩ
    60 7A .
�
    FF FD .

Then, check the character properties for "�" at http://unicode.org/cldr/utility/character.jsp?a=fffd and see that the level 1 sort key (i.e. FF FD) matches the "uca" property. Clicking on that "uca" property takes you to a search page – http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auca%3DFFFD%3A%5D – showing just 1 match. And, in the allkeys.txt file, the level 1 sort weight is shown as 0F12, and there is only 1 match for that.

To make sure that we are interpreting the behavior correctly, I looked at another character: GREEK CAPITAL LETTER OMICRON WITH VARIA Ὸ at http://unicode.org/cldr/utility/character.jsp?a=1FF8 which has a "uca" (i.e. level 1 sort weight / collating element) of 5F30. Clicking on that "5F30" takes us to a search page – http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auca%3D5F30%3A%5D – showing 30 matches, 20 of them being in the 0 - 65535 range (i.e. U+0000 - U+FFFF). Looking in the allkeys.txt file for Code Point 1FF8, we see a level 1 sort weight of 12E0. Doing a "Count" in Notepad++ on 12E0. shows 30 matches (this matches the results from Unicode.org, though it is not guaranteed to since the file is for Unicode v 5.0 and the site is using Unicode v 9.0 data).

In SQL Server, the following query returns 20 matches, same as the Unicode.org search when removing the 10 supplementary characters:

;WITH cte AS
(
  SELECT TOP (65535) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS [Num]
  FROM   [master].sys.columns col
  CROSS JOIN [master].sys.objects obj
)
SELECT cte.Num AS [Decimal],
       CONVERT(VARCHAR(50), CONVERT(VARBINARY(2), cte.Num), 2) AS [Hex],
       NCHAR(cte.Num) AS [Character]
FROM cte
WHERE NCHAR(cte.Num) = NCHAR(0x1FF8) COLLATE Latin1_General_100_CI_AI
ORDER BY cte.Num;

And, just to be sure, going back to the ICU Collation Demo page, and replacing the characters in the "Input" box with the following 3 characters taken from the list of 20 results from SQL Server:

Ὂ
?
Ὸ

shows that they, indeed, all have the same 5F 30 level 1 sort weight (matching the "uca" field on the character property page).

SO, it certainly does seem as though this particular character should not match anything else.

How it actually works (at least in Microsoft-land)

Unlike within SQL Server, .NET has a means of showing the sort key for a string via the CompareInfo.GetSortKey Method. Using this method and passing in just the U+FFFD character, it returns a sort key of 0x0101010100. Then, iterating over all of the characters in the range of 0 - 65535 to see which ones had a sort key of 0x0101010100 returned 4529 matches. This does not exactly match the 5840 returned in SQL Server (when using the Latin1_General_100_CS_AS_WS Collation), but it is the closest we can get (for now) given that I am running Windows 10 and .NET Framework version 4.6.1, which uses Unicode v 6.3.0 according to the chart for the CharUnicodeInfo Class (in "Note to Callers", in the "Remarks" section). For the moment I am using a SQLCLR function and so cannot change the target Framework version. When I get a chance I will create a console app and use a target Framework version of 4.5 as that uses Unicode v 5.0, which should match the 100 series Collations.

What this test shows is that, even without the exact same number of matches between .NET and SQL Server for U+FFFD, it is pretty clear that this is not SQL Server-specific behavior, and that whether intentional or oversight with the implementation done by Microsoft, the U+FFFD character does indeed match quite a few characters, even if it shouldn't according to the Unicode specification. And, given that this character matches U+0000 (null), it is probably just an issue of missing weights.

ALSO

Regarding the difference in behavior in the = query vs the LIKE N'%�%' query, it has to do with the wildcards and the missing (I assume) weights for these (i.e. � Ƕ Ƿ Ǹ) characters. If the LIKE condition is changed to being simply LIKE N'�' then it returns the same 3 rows as the = condition. If the issue with the wildcards is not due to "missing" weights (there is no 0x00 sort key returned by CompareInfo.GetSortKey, btw) then it could be due to these characters potentially having a property that allows the sort key vary based on context (i.e. surrounding characters).

Best Answer

UPDATE

Related Solutions

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Sql-server – the SQL Server equivalent of MySQL’s utf8mb4 character set

Related Question