SQL Server – Collate Issues with Wrong Characters

character-setcollationencodingsql servert-sql

Well, the problem is well known, but I'm looking for a smarter solution if there's one.

For some reason the system doesn't recognize some characters and I can't compare the columns

Here is an example of the text:

Right

ASPIRADOR ULTRASSONICO-LOCAÇAO (NOTA FISCAL SERVIÇO)

Wrong

ASPIRADOR ULTRASSONICO-LOCA€AO (NOTA FISCAL SERVI€O)

Actually I'm fixing this through this function

create function fixcollation(@ps_Texto VARCHAR(4000)) returns VARCHAR(4000) 

as 

begin  

    declare @vlgsv1itu INT declare @nxn68ezzi INT declare @dw17rsyva  VARCHAR(50) declare @iw8a2z01i VARCHAR(50) declare @t64e98xq6 VARCHAR(50) declare @zwjs2imy3 INT declare @jsyt85sy8 VARCHAR(4000)  

    ---------------------------------------------------- 

    set @dw17rsyva = ' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§' 
    set @iw8a2z01i = 'áàãâäÁÀÃÂÄéèêëÈÉÊËíìïÍÌÏóòõôöÓÒÕÔÖúùûüÚÙÛÜçÇºØ' 
    set @jsyt85sy8 = @ps_Texto set @zwjs2imy3 = IsNull(datalength(@ps_Texto), 0) 
    set @nxn68ezzi = 1 
    while(@nxn68ezzi <= IsNull(datalength( @ps_Texto), 0)) 

    begin 

        set @vlgsv1itu = 1 

        while(@vlgsv1itu <= IsNull(datalength(@dw17rsyva), 0)) 
        begin 

            IF(ASCII(SUBSTRING(@ps_Texto, @nxn68ezzi, 1) COLLATE LATIN1_GENERAL_CS_AS) = ASCII(SUBSTRING(@dw17rsyva, @vlgsv1itu, 1) COLLATE LATIN1_GENERAL_CS_AS)) 
            BEGIN 
                set @t64e98xq6 = SUBSTRING( @iw8a2z01i, @vlgsv1itu, 1) set @jsyt85sy8 = SUBSTRING(@jsyt85sy8, 1, @nxn68ezzi -1) + @t64e98xq6 + SUBSTRING(@jsyt85sy8, @nxn68ezzi + 1, @zwjs2imy3 -  @nxn68ezzi) 
                break 
            end 
            set @vlgsv1itu = @vlgsv1itu + 1 
        end 
        set @nxn68ezzi = @nxn68ezzi + 1 
    end 
    return @jsyt85sy8 
end

So, my question is: is this the best way or have I missed something here?

EDIT

Just a complementary test

select dbo.fixcollation(' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§')
select dbo.FixCodePage850toCodePage1252(' …Æƒ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãåâ™£—–éëêš‡€§')

And this is the result in my production environment

fixcollation

FixCodePage850toCodePage1252

My personal thanks to Solomon Rutzky

Best Answer

This is an incorrect encoding issue. The characters are coming in encoded as DOS Code Page 850 yet the target Code Page you are using (based on the Latin1_General Collations) is Windows Code Page 1252. For example, in DOS Code Page 850, the Ç character has a value of 0x80 (or 128 in Decimal). However, that same value of 0x80 in Windows Code Page 1252 gives you €. Likewise, Ã in DOS Code Page 850 has a value of 0xC7 (or 199 in Decimal). However, that same value of 0xC7 in Windows Code Page 1252 gives you Ç.

The incorrect characters are incorrect due to being imported into SQL Server with the wrong encoding being specified for the source. This is not happening within SQL Server as that would be a code page conversion issue, in which case the same "character" would have its value translated for the same character in the target Code Page (if the character exists in the target Code Page, else you get ?). For example:

SELECT ASCII('Ç' COLLATE Latin1_General_CI_AS) AS [CP1252 Value],
       'Ç' COLLATE SQL_Latin1_General_CP850_CI_AS AS [CharacterInCP850],
       ASCII('Ç' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CP850 Value];

Returns:

CP1252 Value     CharacterInCP850     CP850 Value
199              Ç                    128

Meaning, this is happening most likely during a file import — BCP.exe, SQLCMD.exe, BULK INSERT, OPENROWSET(BULK...), custom app code that reads a file, etc — where either the wrong source Code Page is being specified, or no Code Page at all is being specified for the source. If an import is being done that specifies Code Page 1252 for this file, it will have the effect that you are seeing here since those bytes are encoded for Code Page 850, not Code Page 1252.

It should be noted that this could also happen with data coming in from app code if the driver (ODBC, etc) is being told to use the wrong code page.

Now, regarding the method of fixing this:

Ideally the method of importing the data would be updated / fixed to properly account for the actual code page with which the data has been encoded.
If it is not possible to fix the import process, then the function that the other company provided is not the best way to go. In fact, it is probably the slowest, most convoluted approach which is also prone to errors (if they didn't map all of the characters). There is no reason to do two loops along with SUBSTRING when loading the characters, in pairs, into a table variable would have allowed for a single loop using the REPLACE function. And using the ASCII function and a case-sensitive, accent-sensitive Collation is unnecessary and prone to error (if two characters match what is being searched for) when using a _BIN2 Collation would have been better.

Use the following function which does the conversion. First it gets the bytes of the current string, then it injects those bytes into a VARCHAR column that uses Code Page 850, then it selects that value from the table variable into a local variable (necessary anyway to return the value) which has the effect of converting the string into the Code Page used by the default Collation of the Database (which here would have to be Code Page 1252 else you would not be getting the "correct" string out of the function):

USE [tempdb];
GO
CREATE FUNCTION dbo.FixCodePage850toCodePage1252
(
    @CodePage850String VARCHAR(8000)
)
RETURNS VARCHAR(8000)
WITH SCHEMABINDING
AS
BEGIN
  DECLARE @Convert850to1252 TABLE
  (
    [String] VARCHAR(8000) COLLATE SQL_Latin1_General_CP850_CI_AS
  );
  DECLARE @ReturnValue VARCHAR(8000);

  INSERT INTO @Convert850to1252 ([String])
  VALUES (CONVERT(VARBINARY(8000), @CodePage850String, 0));

  SELECT @ReturnValue = [String] -- automatic conversion to Code Page of database
  FROM @Convert850to1252;

  RETURN @ReturnValue;
END;
GO

Testing both functions returns the same results:

SELECT dbo.fixcollation('Ç‰§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)

SELECT dbo.FixCodePage850toCodePage1252('Ç‰§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)

I came up with a test to check the mappings of all characters just in case the company providing the translation function missed any mappings. I filtered out the graphics characters and dotless "i" that are only found in Code Page 850.

USE [tempdb];
GO
;WITH nums AS
(
    SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num]
    FROM   master.sys.columns
), vals AS
(
  SELECT nums.[num] AS [Value],
         CHAR(nums.[num]) AS [Character],
         dbo.fixcollation(CHAR(nums.[num])) AS [OldWay],
         dbo.FixCodePage850toCodePage1252(CHAR(nums.[num])) AS [NewWay]
  FROM   nums
)
SELECT vals.*,
       ASCII(vals.[NewWay]) AS [NewValue]
FROM   vals
WHERE  vals.[Character] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND    vals.[OldWay] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND    vals.[Value] NOT IN (176, 177, 178, 179, 180, 185, 186, 187, 188,
                            191, 192, 193, 194, 195, 196, 197, 200, 201,
                            202, 203, 204, 205, 206, 217, 218, 219, 220,
                            223, 254, 213); -- characters only in CP850

That returns a list of 52 characters that could have come through the import process mistranslated like the others, but skipped by the UDF that you were provided by that other company that only handles 46 of the apparently 98 possible characters.

Related Solutions

Db2 – Dutch characters in DB2 problem

Well, the trick is that a database can only specify which "locale" it is used for at creation time. When you create a database you either specify what you want by specifying the codeset, territory and collation (example CREATE DATABASE MYDB AUTOMATIC STORAGE YES ON '/data' DBPATH ON '/dbdir' USING CODESET UTF-8 TERRITORY US COLLATE USING SYSTEM), or you let DB2 guess based on some system settings.

You can check what yours is set to by running db2 get db cfg for <db name>. In the case of what was used above you would see the following:

Database territory                                      = US
Database code page                                      = 1208
Database code set                                       = UTF-8
Database country/region code                            = 1
Database collating sequence                             = IDENTITY
Alternate collating sequence              (ALT_COLLATE) =

Note, that you cannot change these settings after a database is created. One option would be to export the data from your database (using db2export), drop and rebuild the database the with language settings you want, and then re-import the data (using db2import).

Another option would be to see if there is a way to represent your character strings as pure "bit" data and store the bit sequence rather than the "string" or declare a field as VARCHAR FOR BIT DATA and store your strings as bit sequences (though I'm not 100% positive on that one...).

Also check out the links below as they have more information on the language thing and how to deal with multi-code-page results. Note that some of this information requires DB2 to be at least at version 9.5 or higher.

as well as check out the whole Multicultural support parent/sibling links related to that list link (the Character conversion between different code pages).

SQL Server Collation – Using COLLATE: When to Use and When Not to Use

If you are going to use custom collations for specific databases then yes, you'll need to make the collations match whenever you are joining or unioning data from the two databases.

In fact you will need to do this with many metadata queries anyway. Just look at catalog views like sys.tables:

SELECT c.name, c.collation_name
FROM sys.all_columns AS c
INNER JOIN sys.all_views AS v
ON c.[object_id] = v.[object_id]
INNER JOIN sys.schemas AS s
ON v.[schema_id] = s.[schema_id]
WHERE s.name = N'sys' AND v.name = N'tables'
AND c.collation_name IS NOT NULL;

Results:

name                    SQL_Latin1_General_CP1_CI_AS
type                    Latin1_General_CI_AS_KS_WS
type_desc               Latin1_General_CI_AS_KS_WS
lock_escalation_desc    Latin1_General_CI_AS_KS_WS
durability_desc         Latin1_General_CI_AS_KS_WS

So here we have columns with two different collations inside a single system object. No database can have a DATABASE_DEFAULT that matches both...

You don't have any control over the collation of your customer's columns or databases, or the server collation. So really the only ways to resolve the conflict are to:

use a method that doesn't hard-code a specific collation (like DATABASE_DEFAULT)
hard-code some specific, compatible collation on both sides

Since the latter is more work, makes for more complex queries, and introduces more opportunities to turn seeks into scans, I think the former is really your best option.

Best Answer

Related Solutions

Db2 – Dutch characters in DB2 problem

SQL Server Collation – Using COLLATE: When to Use and When Not to Use

Related Question