SQL Server – Not Displaying Special Characters Imported from a File

bulk-insertencodingimportsql server

The scenario is a SQL Server instance, a database which is data fed using BULK INSERT operations mainly, and some of the texts inserted contain special characters like ñ because I am working in a Spanish environment.

So, after the first little tests I realize these special characters are not being displayed correctly when I run a simple select, so I start checking everything I can think of:

  • File encoding: the file to be bulk-inserted has the right encoding: ANSI
  • Database encoding: the database has the right encoding (thank god): select collation_name from sys.databases where name='DBNAME'; results in SQL_Latin1_General_CP1_CI_AS
    • Latin1: charset used. This suits me fine enough
    • General: nothing really interesting here
    • CP1: this means that it uses the codepage 1 which, in short, means the codepage 1252 <=> codepage used for the encoding WIN-1252 which is fairly similar to Latin1
    • CI: case insensitive
    • AS: Accent sensitive, so á is different from a
  • export data to file and check: file encoding is ANSI but the data is not displayed correctly, there are no special characters, instead I find some other characters that make the text hard to read.

With these tests, I conclude that the data is not being stored correctly, and that is why it is not being displayed and exported correctly. Almost every solution I have found on the internet suggests using nvarchar instead of varchar for string datatype fields, but that does not solve this scenario. What is it that is sabotaging my insertions?

Best Answer

If characters are being imported incorrectly, then there is a problem with one (or both) of the following areas, as this is really a two-step process:

  1. BCP / BULK INSERT does not know how the file is encoded and is interpreting it incorrectly

  2. The destination column is VARCHAR (or CHAR {or TEXT, but don't use TEXT} ) and the Collation of that destination column (not database) uses a code page that does not have a mapping for the incorrectly imported character(s).

It is important to note here that the default Collation for the database where the import table resides is not relevant. The only Collation that matters here is the Collation of each particular string column being imported into. And the Collation of each and every string column can be different, and none of them need to be the same as the database's default Collation. It is just typical to have the Collation of most columns in a database match the database's default as that will be the Collation used when creating new tables and columns and not specifying a Collation via the COLLATE keyword.

Step 1: File Encoding
BCP / BULK INSERT (or most any other code reading a file) will not know what encoding a file is using unless the file is using one of the few encodings to have a Byte Order Mark (BOM). But Extended ASCII encodings don't use byte order marks and thus cannot be programmatically determined (at least not with certainty). When using Extended ASCII encodings, you either need to specify the code page or else a default will be assumed.

As per the MSDN page for the bcp Utility, under the -C option:

OEM -> Default code page used by the client. This is the default code page used if -C is not specified.

You can determine the default code page by opening a command prompt and running either mode or chcp (I prefer chcp as it also allows for changing the code page if you pass in a value).

If your default code page is 850 and yet the file was saved with an encoding of Windows-1252 Latin1 (ANSI), then there could easily be problems interpreting the file since the character mappings are not the same between those two code pages. This has nothing to do with SQL Server in any way.

The ñ character has a value of 241 on code page 1252. BUT, on code page 850 that same character has a value of 164. The file, regardless of anything else, is a series of bytes, and one of those bytes has a decimal value of 241 (because when it was saved, it was told to use code page 1252 which determined that ñ needed to be stored as 241). Now, when BCP reads the file, if it is using the default MS-DOS code page of 850, that same byte value of 241 maps to character ±. If you were to specify a code page of either ACP or 1252 (same thing) via the -C switch, then BCP will know that a byte of value 241 is actually ñ. Or, you could specify a code page of 1255 (Windows Hebrew), then BCP / BULK INSERT will interpret that same byte value of 241 as being character ס.

Step 2: Destination Column Datatype and Collation

Once BCP / BULK INSERT (or any client app) reads the data in, it exists as those mappings, not just the base byte values. Whatever characters were read into BCP / BULK INSERT will be stored in the destination column as that character, as long as that character can be mapped in the destination datatype and Collation. A destination datatype of NVARCHAR, NCHAR, and NTEXT (but don't use NTEXT) can hold all characters, so it won't matter what the Collation is. But, if the destination datatype is VARCHAR, CHAR, or TEXT then the Collation will determine the code page which in turn determines the character mappings. If the destination datatype is one of those last 3 mentioned and uses Collation associated with the same code page used for the file, then everything should work just fine. Or, if the Collation is associated with a different code page, then an attempt will be made to map the character, not the byte value.

Meaning, if BCP / BULK INSERT is working with a code page of 1252 and a character of ñ (value 241 on code page 1252), then if you import that into a VARCHAR column with a Collation of SQL_Latin1_General_CP850_CI_AS -- which uses code page 850 -- then you will see a character of ñ (same character, but value 164 on code page 850) instead of ±, which has the same 241 value on code page 850. But, if you import into a VARCHAR column with a Collation of Hebrew_CI_AS -- which uses code page 1255 -- then you will see ? instead of ס (value 241 on code page 1255) as there is no mapping for ñ on code page 1255.