Well, the problem is well known, but I'm looking for a smarter solution if there's one.
For some reason the system doesn't recognize some characters and I can't compare the columns
Here is an example of the text:
Right
ASPIRADOR ULTRASSONICO-LOCAÇAO (NOTA FISCAL SERVIÇO)
Wrong
ASPIRADOR ULTRASSONICO-LOCA€AO (NOTA FISCAL SERVI€O)
Actually I'm fixing this through this function
create function fixcollation(@ps_Texto VARCHAR(4000)) returns VARCHAR(4000)
as
begin
declare @vlgsv1itu INT declare @nxn68ezzi INT declare @dw17rsyva VARCHAR(50) declare @iw8a2z01i VARCHAR(50) declare @t64e98xq6 VARCHAR(50) declare @zwjs2imy3 INT declare @jsyt85sy8 VARCHAR(4000)
----------------------------------------------------
set @dw17rsyva = ' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§'
set @iw8a2z01i = 'áàãâäÁÀÃÂÄéèêëÈÉÊËíìïÍÌÏóòõôöÓÒÕÔÖúùûüÚÙÛÜçǺØ'
set @jsyt85sy8 = @ps_Texto set @zwjs2imy3 = IsNull(datalength(@ps_Texto), 0)
set @nxn68ezzi = 1
while(@nxn68ezzi <= IsNull(datalength( @ps_Texto), 0))
begin
set @vlgsv1itu = 1
while(@vlgsv1itu <= IsNull(datalength(@dw17rsyva), 0))
begin
IF(ASCII(SUBSTRING(@ps_Texto, @nxn68ezzi, 1) COLLATE LATIN1_GENERAL_CS_AS) = ASCII(SUBSTRING(@dw17rsyva, @vlgsv1itu, 1) COLLATE LATIN1_GENERAL_CS_AS))
BEGIN
set @t64e98xq6 = SUBSTRING( @iw8a2z01i, @vlgsv1itu, 1) set @jsyt85sy8 = SUBSTRING(@jsyt85sy8, 1, @nxn68ezzi -1) + @t64e98xq6 + SUBSTRING(@jsyt85sy8, @nxn68ezzi + 1, @zwjs2imy3 - @nxn68ezzi)
break
end
set @vlgsv1itu = @vlgsv1itu + 1
end
set @nxn68ezzi = @nxn68ezzi + 1
end
return @jsyt85sy8
end
So, my question is: is this the best way or have I missed something here?
EDIT
Just a complementary test
select dbo.fixcollation(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§')
select dbo.FixCodePage850toCodePage1252(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§')
And this is the result in my production environment
fixcollation
FixCodePage850toCodePage1252
My personal thanks to Solomon Rutzky
Best Answer
This is an incorrect encoding issue. The characters are coming in encoded as DOS Code Page 850 yet the target Code Page you are using (based on the
Latin1_General
Collations) is Windows Code Page 1252. For example, in DOS Code Page 850, theÇ
character has a value of 0x80 (or 128 in Decimal). However, that same value of 0x80 in Windows Code Page 1252 gives you€
. Likewise,Ã
in DOS Code Page 850 has a value of 0xC7 (or 199 in Decimal). However, that same value of 0xC7 in Windows Code Page 1252 gives youÇ
.The incorrect characters are incorrect due to being imported into SQL Server with the wrong encoding being specified for the source. This is not happening within SQL Server as that would be a code page conversion issue, in which case the same "character" would have its value translated for the same character in the target Code Page (if the character exists in the target Code Page, else you get
?
). For example:Returns:
Meaning, this is happening most likely during a file import — BCP.exe, SQLCMD.exe,
BULK INSERT
,OPENROWSET(BULK...)
, custom app code that reads a file, etc — where either the wrong source Code Page is being specified, or no Code Page at all is being specified for the source. If an import is being done that specifies Code Page 1252 for this file, it will have the effect that you are seeing here since those bytes are encoded for Code Page 850, not Code Page 1252.It should be noted that this could also happen with data coming in from app code if the driver (ODBC, etc) is being told to use the wrong code page.
Now, regarding the method of fixing this:
SUBSTRING
when loading the characters, in pairs, into a table variable would have allowed for a single loop using theREPLACE
function. And using theASCII
function and a case-sensitive, accent-sensitive Collation is unnecessary and prone to error (if two characters match what is being searched for) when using a_BIN2
Collation would have been better.Use the following function which does the conversion. First it gets the bytes of the current string, then it injects those bytes into a
VARCHAR
column that uses Code Page 850, then it selects that value from the table variable into a local variable (necessary anyway to return the value) which has the effect of converting the string into the Code Page used by the default Collation of the Database (which here would have to be Code Page 1252 else you would not be getting the "correct" string out of the function):Testing both functions returns the same results:
I came up with a test to check the mappings of all characters just in case the company providing the translation function missed any mappings. I filtered out the graphics characters and dotless "i" that are only found in Code Page 850.
That returns a list of 52 characters that could have come through the import process mistranslated like the others, but skipped by the UDF that you were provided by that other company that only handles 46 of the apparently 98 possible characters.