Sql-server – Why does MS SQL Server return a result for empty string check when Unicode string is not empty

collationencodingsql serverunicode

select * from (select N'ግዜ ' as t) as t2 where t= ''

the string 'ግዜ ' matches the above check, why is this?

Best Answer

While I am not sure of the exact reason for those specific characters, t The issue has to do with the older collations (please see UPDATE section at the end). And it is not just empty string that they equate to, but also to just one of those characters:

SELECT * FROM (SELECT N'ግዜ') tab(col) WHERE tab.col = N'ግ';

And if you try a case-sensitive collation, even with multiple characters, they still equate:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግግግግ' COLLATE SQL_Latin1_General_CP1_CS_AS;
SELECT * FROM (SELECT N'ግ') t(c) WHERE t.c = N'ዜዜዜዜ' COLLATE SQL_Latin1_General_CP1_CS_AS;
SELECT * FROM (SELECT N'ዜ') t(c) WHERE t.c = N'ግግግግ' COLLATE SQL_Latin1_General_CP1_CS_AS;

Even the "equivalent" Windows Collations have the same issue:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_CS_AS;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_CS_AS_KS_WS;

BUT, it seems that newer versions of the Windows Collations (i.e. the 100 series or newer) "fixes" the issue and these no longer equate:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_CI_AI;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_CI_AS;

And, of course, the binary Windows Collations (both older and newer series) work just fine as the following do not report a match:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_BIN;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_BIN2;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_BIN;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_BIN2;

UPDATE (2015-08-20)
After 6 hours of pouring through documentation on http://www.unicode.org/, http://site.icu-project.org/, and a couple of other Unicode-related sites, I gave up trying to find evidence of a "weighting" change that might have occurred just prior to 2008 (the new 100 series of collations were introduced in SQL Server 2008). I did, however, find the following info at www.fileformat.info for the two characters being tested here:

So, I moved on to the next project and moments later came across the following on the SQL Server 2008 MSDN page for Collation and Unicode Support:

SQL Server 2008 has introduced new collations that are in full alignment with collations that Windows Server 2008 provides. These 80 new collations are denoted by *_100 version references. They provide users with the most up-to-date and linguistically accurate cultural sorting conventions. Support includes the following:

  • ...
  • Weighting has been added to previously non-weighted characters that would have compared equally.

No sort weight for a character means that it is effectively invisible.

Moral of the story: don't try so hard; give up sooner ;-)

UPDATE (2018-09-20)

For a more visual indication of what is going on, the query below compares each BMP character (Code Points 0 - 65535 / U+0000 - U+FFFF) to an empty string. The comparison is repeated using different collations: BIN2, a SQL Server Collations, Latin1_General that started with SQL Server 2000, Latin1_General that started with SQL Server 2008, Japanese_XJIS that started with SQL Server 2008, and Japanese_XJIS that started with SQL Server 2017. The two Collations starting in SQL Server 2008 are showing that both return the same number of matches, yet the newer Japanese_XJIS Collation returns a different number (the only Collations updated in SQL Server 2017 are the Japanese Collations). This is done to show how many characters are missing sort weights across the various Collation versions.

;WITH nums AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [CodePoint]
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
)
SELECT nums.[CodePoint],
       COALESCE(NCHAR(nums.[CodePoint]), N'TOTALS:') AS [Character],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_BIN2) THEN 1 END) AS [BIN2],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE SQL_Latin1_General_CP1_CS_AS) THEN 1 END) AS [SQL Collations],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_CS_AS_KS_WS) THEN 1 END) AS [SQL2000 Latin1],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_100_CS_AS_KS_WS) THEN 1 END) AS [SQL2008 Latin1],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Japanese_XJIS_100_CS_AS_KS_WS) THEN 1 END) AS [SQL2008 Japanese],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Japanese_XJIS_140_CS_AS_KS_WS) THEN 1 END) AS [SQL2017 Japanese]
FROM   nums
GROUP BY ROLLUP ((nums.[CodePoint], NCHAR(nums.[CodePoint])));

To see the details for all rows, execute the query above. But for just the summary, that is:

BIN2  SQL Collations SQL2000 Latin1  SQL2008 Latin1  SQL2008 Japanese  SQL2017 Japanese
1     21230          21229           5840            5840              3375