After doing a load of CSV files, there are various words that are "written" to the database incorrectly.
Some examples:
Diã¡ria
should beDiária
Crã©dito
should beCrédito
Ligaã§ãµes
should beLigações
Usuã¡rio
should beUsuário
Nãºmeros
should beNúmeros
Is there any way to translate the symbols to the correct characters?
I've done several tests with different collations
and functions
that could search the internet, but without success.
Best Answer
No, the characters are being read incorrectly. They are being written correctly. Or, another way to look at this is: the characters are being written incorrectly to the data file. Either way, SQL Server is doing exactly what it is being asked to do.
This is a simple encoding issue. The data was originally exported as UTF-8, but then that UTF-8 encoded file was read into SQL Server as if it were an Extended ASCII file using Code Page 1252. There are three clues which indicate that this is the issue:
UTF-8 is a multi-byte encoding: it uses a different number of bytes depending on the character being encoded. The first 128 Code Points (U+0000 - U+007F), which contain the US English alphabet, all use 1 byte. Code Points above that range take 2 - 4 bytes. This is why some characters transferred between systems as expected:
N
(upper-case Latin "N") is 0x4E in UTF-8 as well as in Code Page 1252 (in fact, it's 0x4E across all 8-bit Code Pages supported by SQL Server). This is one of the reasons that UTF-8 is so popular. However, the accented characters are not 1 byte in UTF-8:The
á
( U+00E1 ) is encoded as two bytes in UTF-8: 0xC3 and 0xA1. When those two bytes are read by something expecting Code Page 1252, they are interpreted as beingÃ
( 0xC3 on Code Page 1252 ) and¡
( 0xA1 on Code Page 1252 ). Then, either you or your import process, lower-cased theÃ
(probably because it was in the middle of a word) which is how you ended up with:Usuã¡rio
.The
ú
( U+00FA ) is encoded as two bytes in UTF-8: 0xC3 and 0xBA. When those two bytes are read by something expecting Code Page 1252, they are interpreted as beingÃ
( 0xC3 on Code Page 1252 ) andº
( 0xBA on Code Page 1252 ). Then, either you or your import process, lower-cased theÃ
(probably because it was in the middle of a word) which is how you ended up with:Nãºmero
.Your choices are:
Encode the file as (Extended) ASCII using Code Page 1252 (as it is exported), and do not change how it is being read into SQL Server. (it sounds like you have already done this).
Continue exporting the file using the UTF-8 encoding, but change how the file is being read into SQL Server by specifying that it is encoded as UTF-8. Please note that for anyone using BCP.exe,
BULK INSERT
, orOPENROWSET(BULK...)
, this option only became available starting in SQL Server 2016. The Code Page to use is65001
(which usually means UTF-8 with Byte Order Mark, but I am not sure if SQL Server requires the Byte Order Mark in these cases).