Sql-server – Character Conversion Issue Importing From CSV File

collationencodingetlimportsql server

After doing a load of CSV files, there are various words that are "written" to the database incorrectly.

Some examples:

Diã¡ria should be Diária
Crã©dito should be Crédito
Ligaã§ãµes should be Ligações
Usuã¡rio should be Usuário
Nãºmeros should be Números

Is there any way to translate the symbols to the correct characters?

I've done several tests with different collations and functions that could search the internet, but without success.

Best Answer

there are various words that are "written" to the database incorrectly.

No, the characters are being read incorrectly. They are being written correctly. Or, another way to look at this is: the characters are being written incorrectly to the data file. Either way, SQL Server is doing exactly what it is being asked to do.

This is a simple encoding issue. The data was originally exported as UTF-8, but then that UTF-8 encoded file was read into SQL Server as if it were an Extended ASCII file using Code Page 1252. There are three clues which indicate that this is the issue:

Most characters come through correctly. The ones that don't are the accented characters.
The characters that are incorrect are coming through as two characters instead of one.
You mentioned that exporting to an "ASCII" file worked.

UTF-8 is a multi-byte encoding: it uses a different number of bytes depending on the character being encoded. The first 128 Code Points (U+0000 - U+007F), which contain the US English alphabet, all use 1 byte. Code Points above that range take 2 - 4 bytes. This is why some characters transferred between systems as expected: N (upper-case Latin "N") is 0x4E in UTF-8 as well as in Code Page 1252 (in fact, it's 0x4E across all 8-bit Code Pages supported by SQL Server). This is one of the reasons that UTF-8 is so popular. However, the accented characters are not 1 byte in UTF-8:

The á ( U+00E1 ) is encoded as two bytes in UTF-8: 0xC3 and 0xA1. When those two bytes are read by something expecting Code Page 1252, they are interpreted as being Ã ( 0xC3 on Code Page 1252 ) and ¡ ( 0xA1 on Code Page 1252 ). Then, either you or your import process, lower-cased the Ã (probably because it was in the middle of a word) which is how you ended up with: Usuã¡rio.
The ú ( U+00FA ) is encoded as two bytes in UTF-8: 0xC3 and 0xBA. When those two bytes are read by something expecting Code Page 1252, they are interpreted as being Ã ( 0xC3 on Code Page 1252 ) and º ( 0xBA on Code Page 1252 ). Then, either you or your import process, lower-cased the Ã (probably because it was in the middle of a word) which is how you ended up with: Nãºmero.

Your choices are:

Encode the file as (Extended) ASCII using Code Page 1252 (as it is exported), and do not change how it is being read into SQL Server. (it sounds like you have already done this).
Continue exporting the file using the UTF-8 encoding, but change how the file is being read into SQL Server by specifying that it is encoded as UTF-8. Please note that for anyone using BCP.exe, BULK INSERT, or OPENROWSET(BULK...), this option only became available starting in SQL Server 2016. The Code Page to use is 65001 (which usually means UTF-8 with Byte Order Mark, but I am not sure if SQL Server requires the Byte Order Mark in these cases).

Best Answer

Related Solutions

ORA-12899: importing a .dmp file + character set conversion

Best Practices for Testing Application Dashboard Data

Related Question