Sql-server – Why do two identical strings have a different length but the same binary value

collationdistinctsql serversql-server-2016

I am trying to return a distinct set of department names from a table – nothing special. However, when using the following query duplicates are displayed:

  select distinct department_name
    from dbo.departments;

I also tried:

  select distinct department_name
    from dbo.departments
group by department_name;

So this led me to believe that I might have hidden characters in the values and sure enough, when I checked the length of the strings they returned different values. So, I decided to locate the hidden characters using the function from this question on stack overflow. Strangely enough this only return SPACE. I then tried the following query and it made no difference at all:

select distinct ltrim(rtrim(department_name)) as department_name
  from dbo.departments;

Intrigued, I cast the values to VARBINARY and noticed that they have the exact same binary value, and performing a DISTINCT on the binary value does indeed produce a unique result set.

I have tried casting between VARCHAR and NVARCHAR and to a different collation also (the values are in the same column, within the same database using Latin1_General_CI_AI). I really need to be able to get a distinct set from this table. Does anybody know what might be causing this issue?

UPDATE

After some further investigation, this issue seems to only happen for strings that end with the hex value 0xA000. Any value in the column not ending with this character are fine.

UPDATE 2

If I remove the 0xA000 character from the string I can then apply DISTINCT as normal like so:

DECLARE @binary VARBINARY(8) = 0xA000;
DECLARE @string VARCHAR(8) = CONVERT(VARCHAR(MAX), @binary);
 UPDATE dbo.departments
    SET department_name = REPLACE(department_name, @string, '');

But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the WHERE clause. I am using a workaround right now which is no more than using MIN to return the entry with the shortest length. This is less than ideal as the problem with distinct is also affecting most of the other language elements as well, for example GROUP BY, ORDER BY, any window functions and COUNT.

Best Answer

Good job investigating so far. Some initial notes:

  1. I wouldn't worry about that function from the S.O. answer.

  2. RTRIM and LTRIM only trim spaces, not white-space in general:

    SELECT RTRIM('A    ') + 'a';
    -- Aa
    
    SELECT RTRIM('A    ' + CHAR(9)) + 'a'; -- CHAR(9) = tab
    -- A        a
    
  3. Adding GROUP BY (2nd query) doesn't change that query since it was implied in the first query ;-).

  4. Whether 0xA000 is 2 VARCHAR characters or 1 NVARCHAR character, there does not seem to be any special behavior with this sequence of bytes for either datatype, using either Latin1_General_CI_AI or Latin1_General_100_CI_AI.

Still, something is amiss. You simply cannot have different length (or different anything) with the same binary value. How was length determined: LEN or DATALENGTH? Is it possible that the value got truncated somewhere in the testing so that it then appeared to be the same?

In order to help further, we need to know two things (please update the question with the results):

  1. The exact datatype of the department_name. Please find via:

    SELECT typ.[name], col.*
    FROM   sys.columns col
    INNER JOIN sys.types typ
            ON typ.[user_type_id] = col.[user_type_id]
    WHERE  col.[object_id] = OBJECT_ID(N'dbo.departments')
    AND    col.[name] = N'department_name';
    
  2. The output of the following query:

    SELECT dept.department_name,
           LEN(dept.department_name) AS [name_chars],
           DATALENGTH(dept.department_name) AS [name_bytes],
           CONVERT(VARBINARY(MAX), dept.department_name) AS [name_hex]
    FROM   dbo.departments dept
    ORDER BY dept.department_name
    

    Just find a set of rows for a name that looks the same but is showing duplicates via DISTINCT.

For the moment, I can say that if you have character 0 (ASCII value 0, Code Point U+0000, CHAR(0), NCHAR(0)) in your values then appearances can be deceiving since char(0) is the "null terminator" for strings. So neither it nor anything after it will be displayed, but it and everything after are still part of the string:

DECLARE @Test TABLE
(
  [Something] VARCHAR(50) COLLATE Latin1_General_CI_AI
);

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');

INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'don''t');

SELECT DISTINCT [Something],
       LEN([Something]) AS [Something_chars],
       DATALENGTH([Something]) AS [Something_bytes]
FROM   @Test;
/*
Something        Something_chars    Something_bytes
Sometimes you    19                 19
Sometimes you    29                 29
*/

Enjoy: Peter Paul Almond Joy & Mounds - "Feel Like A Nut" (1980)


TEMPORARY UPDATE (WAITING FOR QUESTION TO BE UPDATED WITH QUERY OUTPUT):

Based on this info from a comment:

I used both LEN and DATALENGTH to perform the check. For two exact same strings I got (20,40) and (21,42) respectively.

it is clear that:

  1. these are definitely different values, and
  2. the datatype of the column is NVARCHAR (since DATALENGTH is twice LEN)

That means that the 0xA000 value is a single, UTF16LE character. Due to being Little Endian (bytes in reverse order), the actual Code Point is U+00A0. That character is:

No-Break Space

As in our favorite HTML character:  

All you need to do is remove those characters on the way into the DB, using:

REPLACE(@InputParam, NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')

For example:

SELECT CONVERT(VARBINARY(MAX),
      REPLACE(N'test' + NCHAR(0x00A0), NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')
    );

To be clear, all of that work messing with VARBINARY and VARCHAR, etc is unnecessary.

But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the WHERE clause.

True, updating every WHERE clause is not a workable solution. This is why you need to sanitize the input on the way in. There are a limited number of entry points for the data (INSERT / UPDATE procs for the UI, possibly some ETL processes), so it shouldn't be that bad. You can request that the developers strip out "bad" characters before they call the stored procedure(s), but there is no guarantee that they will, or that new code will, or that things won't change later, or that they will be able to fix ETL processes, etc.