Sql-server – Why do two identical strings have a different length but the same binary value

collationdistinctsql serversql-server-2016

I am trying to return a distinct set of department names from a table – nothing special. However, when using the following query duplicates are displayed:

  select distinct department_name
    from dbo.departments;

I also tried:

  select distinct department_name
    from dbo.departments
group by department_name;

So this led me to believe that I might have hidden characters in the values and sure enough, when I checked the length of the strings they returned different values. So, I decided to locate the hidden characters using the function from this question on stack overflow. Strangely enough this only return SPACE. I then tried the following query and it made no difference at all:

select distinct ltrim(rtrim(department_name)) as department_name
  from dbo.departments;

Intrigued, I cast the values to VARBINARY and noticed that they have the exact same binary value, and performing a DISTINCT on the binary value does indeed produce a unique result set.

I have tried casting between VARCHAR and NVARCHAR and to a different collation also (the values are in the same column, within the same database using Latin1_General_CI_AI). I really need to be able to get a distinct set from this table. Does anybody know what might be causing this issue?

UPDATE

After some further investigation, this issue seems to only happen for strings that end with the hex value 0xA000. Any value in the column not ending with this character are fine.

UPDATE 2

If I remove the 0xA000 character from the string I can then apply DISTINCT as normal like so:

DECLARE @binary VARBINARY(8) = 0xA000;
DECLARE @string VARCHAR(8) = CONVERT(VARCHAR(MAX), @binary);
 UPDATE dbo.departments
    SET department_name = REPLACE(department_name, @string, '');

But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the WHERE clause. I am using a workaround right now which is no more than using MIN to return the entry with the shortest length. This is less than ideal as the problem with distinct is also affecting most of the other language elements as well, for example GROUP BY, ORDER BY, any window functions and COUNT.

Best Answer

Good job investigating so far. Some initial notes:

I wouldn't worry about that function from the S.O. answer.

RTRIM and LTRIM only trim spaces, not white-space in general:

SELECT RTRIM('A    ') + 'a';
-- Aa

SELECT RTRIM('A    ' + CHAR(9)) + 'a'; -- CHAR(9) = tab
-- A        a

Adding GROUP BY (2nd query) doesn't change that query since it was implied in the first query ;-).
Whether 0xA000 is 2 VARCHAR characters or 1 NVARCHAR character, there does not seem to be any special behavior with this sequence of bytes for either datatype, using either Latin1_General_CI_AI or Latin1_General_100_CI_AI.

Still, something is amiss. You simply cannot have different length (or different anything) with the same binary value. How was length determined: LEN or DATALENGTH? Is it possible that the value got truncated somewhere in the testing so that it then appeared to be the same?

In order to help further, we need to know two things (please update the question with the results):

The exact datatype of the department_name. Please find via:

SELECT typ.[name], col.*
FROM   sys.columns col
INNER JOIN sys.types typ
        ON typ.[user_type_id] = col.[user_type_id]
WHERE  col.[object_id] = OBJECT_ID(N'dbo.departments')
AND    col.[name] = N'department_name';

The output of the following query:

SELECT dept.department_name,
       LEN(dept.department_name) AS [name_chars],
       DATALENGTH(dept.department_name) AS [name_bytes],
       CONVERT(VARBINARY(MAX), dept.department_name) AS [name_hex]
FROM   dbo.departments dept
ORDER BY dept.department_name

Just find a set of rows for a name that looks the same but is showing duplicates via DISTINCT.

For the moment, I can say that if you have character 0 (ASCII value 0, Code Point U+0000, CHAR(0), NCHAR(0)) in your values then appearances can be deceiving since char(0) is the "null terminator" for strings. So neither it nor anything after it will be displayed, but it and everything after are still part of the string:

DECLARE @Test TABLE
(
  [Something] VARCHAR(50) COLLATE Latin1_General_CI_AI
);

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'don''t');

SELECT DISTINCT [Something],
       LEN([Something]) AS [Something_chars],
       DATALENGTH([Something]) AS [Something_bytes]
FROM   @Test;
/*
Something        Something_chars    Something_bytes
Sometimes you    19                 19
Sometimes you    29                 29
*/

Enjoy: Peter Paul Almond Joy & Mounds - "Feel Like A Nut" (1980)

TEMPORARY UPDATE (WAITING FOR QUESTION TO BE UPDATED WITH QUERY OUTPUT):

Based on this info from a comment:

I used both LEN and DATALENGTH to perform the check. For two exact same strings I got (20,40) and (21,42) respectively.

it is clear that:

these are definitely different values, and
the datatype of the column is NVARCHAR (since DATALENGTH is twice LEN)

That means that the 0xA000 value is a single, UTF16LE character. Due to being Little Endian (bytes in reverse order), the actual Code Point is U+00A0. That character is:

No-Break Space

As in our favorite HTML character:  

All you need to do is remove those characters on the way into the DB, using:

REPLACE(@InputParam, NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')

For example:

SELECT CONVERT(VARBINARY(MAX),
      REPLACE(N'test' + NCHAR(0x00A0), NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')
    );

To be clear, all of that work messing with VARBINARY and VARCHAR, etc is unnecessary.

But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the WHERE clause.

True, updating every WHERE clause is not a workable solution. This is why you need to sanitize the input on the way in. There are a limited number of entry points for the data (INSERT / UPDATE procs for the UI, possibly some ETL processes), so it shouldn't be that bad. You can request that the developers strip out "bad" characters before they call the stored procedure(s), but there is no guarantee that they will, or that new code will, or that things won't change later, or that they will be able to fix ETL processes, etc.

Related Solutions

Sql-server – Why are non-digits LIKE [0-9]

[0-9] is not some type of regular expression defined to just match digits.

Any range in a LIKE pattern matches characters between the start and end character according to collation sort order.

SELECT CodePoint,
       Symbol,
       RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM   #CodePage
WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER  BY Symbol COLLATE Latin1_General_CI_AS

Returns

CodePoint            Symbol Rnk
-------------------- ------ --------------------
48                   0      1
188                  ¼      2
189                  ½      3
190                  ¾      4
185                  ¹      5
49                   1      5
50                   2      7
178                  ²      7
179                  ³      9
51                   3      9
52                   4      11
53                   5      12
54                   6      13
55                   7      14
56                   8      15
57                   9      16

So you get these results because under your default collation these characters sort after 0 but before 9.

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0 and 1.

You could also use a set rather than a range. To avoid 2 matching ² you would need a CS collation

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS

Mysql – (non-binary) MySQL collation that doesn’t treat different mathematical symbols as the same character

From 10.1.14.1 Unicode Character Sets in the MySQL 5.5 Reference Manual (emphasis added):

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA collations, their collating weight is 0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

and:

The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters.

So the answer to your question appears to be "no".

finally:

If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

Though it is not clear to me how this should be applied for comparisons.

The documentation extracts quoted above are unchanged for MySQL 5.7.

Best Answer

Related Solutions

Sql-server – Why are non-digits LIKE [0-9]

Mysql – (non-binary) MySQL collation that doesn’t treat different mathematical symbols as the same character

Related Question