For some reason the MS SQL Server 2016 bulk insert misinterprets/translates Unicode characters:
- C9 (É) into 2B (+)
- A1 (¡) into ED (í)
- A0 ( ) into E1 (á)
- AE (®) into AB («)
- CB (Ë) into 2D (-)
- D1 (Ñ) into 2D (-)
- 92 (’) into C6 (Æ)
- 96 (–) into FB (û)
i.e. Notepad++ and xxd show the flat file has 0xC9, but after bulk insert the table shows "+", and cast as varbinary in SQL Server shows it as 0x2B. The backup also has 0xC9.
I'm bulk inserting 25 flat files into MS SQL Server 2016. It's 15Gb data and I'm using a pipe (|) field delimiter and CRLF row delimiter.
I bulk insert into the truncated structure of a backup I am provided. And when I compare to the backup there are differences. NOTE: I have to wait 25 hours for the backup from the data source, but can get the flat files in 15 minutes.
Some differences are acceptable (find and replace I'm applying to the flat files), but many are due to Unicode characters being misinterpreted.
The structure of one example table is:
CREATE TABLE [dbo].[obfuscated_name](
[ob_1] [int] NOT NULL,
[ob_2] [int] NOT NULL,
[ob_3] [int] NOT NULL,
[ob_4] [nvarchar](300) NULL
) ON [PRIMARY]
The database collation is default: SQL_Latin1_General_CP1_CI_AS. No columns have a different collation. This collation should use code page 1252, which should properly interpret the chars I'm having trouble with.
My process is running against production data that is constantly in flux, so I'm worried other changes may pop up, and I'd like to know the source of the issue rather than try to isolate the issue and manually update the misinterpretations.
Best Answer
This is not a bug in SQL Server (or even in Windows), nor is it a situation that requires the additional step of converting the file into another encoding (i.e. into "Unicode", which in Windows-world means "UTF-16 Little Endian"). It is just a simple miscommunication.
The source of the communication breakdown (it's always the same, right ;-) is merely not agreeing on the nature of the source data. When moving character data from one place to another, it is important to specify the encoding on both sides. Yes, the
SQL_Latin1_General_CP1_*
Collations use Code Page 1252. However, if you don't tellBULK INSERT
or BCP.exe what the Code Page of the source file is, then they will assume that the Code Page is the system default.The documentation for BULK INSERT even states (for the
CODEPAGE =
argument):The documentation for BCP.exe states (for the
-C
switch):And the default Code Page for Windows is (at least for US English systems): 437. You can see this if you execute the following in a Command Prompt:
It will return:
But your source file wasn't encoded using Code Page 437. It was encoded using Code Page 1252.
So here is what is happening:
É
when using Code Page 1252.╔
, but BULK INSERT / BCP isn't displaying it so you won't see this)®
on Code Page 1252) on Code Page 437 (which displays as«
) maps to byte 0xAB on Code Page 1252 (because it also displays as«
).The following example shows this conversion for all of the characters noted in the question:
which returns:
The characters for 0xC9, 0XCB, and 0xD1 do not exist in Code Page 1252, so the "best fit" mappings are used, which is why you end up with the
+
and-
characters after the conversion.Also, even if the destination column is using
NVARCHAR
, all of these mappings are the same, so you would be seeing the exact same behavior.So, your choices are:
If using the T-SQL
BULK INSERT
command, specify theWITH CODEPAGE =
option with one of the following values:'ACP'
(this is the same as'1252'
)'RAW'
(this uses the Code Page of the column's Collation if inserting intoVARCHAR
, or is the same as'OEM'
/ Code Page 437 when inserting intoNVARCHAR
)'1252'
(this is the same as'ACP'
)Or, if using BCP.exe, indicate that the incoming file uses Code Page 1252 via the
-C
command-line switch along with one of the following values (see notes in option #1):ACP
RAW
1252
Please note that:
BULK INSERT
, inserting into aVARCHAR
column, using the set of characters noted in the question, and theACP
(which I believe stands for ANSI Code Page),RAW
, and1252
values all produced the correct results.WITH CODEPAGE =
produced the same results that the O.P. reported in the question. This was the same as specifyingWITH CODEPAGE = 'OEM'
.NVARCHAR
column, bothACP
and1252
worked as desired, butRAW
produced the same result asOEM
(i.e. using Code Page 437 instead of Code Page 1252 which is specified by the Collation of the column).-C
switch did not use the Code Page of the process. Meaning, using CHCP to change the command prompt's Code Page had no effect: Code Page 437 was still used as the source Code Page.P.S. Since the data here is all 8-bit encoded, there are no "Unicode characters" since there is no Unicode being used.