Sql-server – SQL Server Bulk Insert properly interprets some Unicode characters but not others

bulk-insertcollationencodingsql serverunicode

For some reason the MS SQL Server 2016 bulk insert misinterprets/translates Unicode characters:

  • C9 (É) into 2B (+)
  • A1 (¡) into ED (í)
  • A0 ( ) into E1 (á)
  • AE (®) into AB («)
  • CB (Ë) into 2D (-)
  • D1 (Ñ) into 2D (-)
  • 92 (’) into C6 (Æ)
  • 96 (–) into FB (û)

i.e. Notepad++ and xxd show the flat file has 0xC9, but after bulk insert the table shows "+", and cast as varbinary in SQL Server shows it as 0x2B. The backup also has 0xC9.

I'm bulk inserting 25 flat files into MS SQL Server 2016. It's 15Gb data and I'm using a pipe (|) field delimiter and CRLF row delimiter.

I bulk insert into the truncated structure of a backup I am provided. And when I compare to the backup there are differences. NOTE: I have to wait 25 hours for the backup from the data source, but can get the flat files in 15 minutes.

Some differences are acceptable (find and replace I'm applying to the flat files), but many are due to Unicode characters being misinterpreted.

The structure of one example table is:

CREATE TABLE [dbo].[obfuscated_name](
    [ob_1] [int] NOT NULL,
    [ob_2] [int] NOT NULL,
    [ob_3] [int] NOT NULL,
    [ob_4] [nvarchar](300) NULL
) ON [PRIMARY]

The database collation is default: SQL_Latin1_General_CP1_CI_AS. No columns have a different collation. This collation should use code page 1252, which should properly interpret the chars I'm having trouble with.

My process is running against production data that is constantly in flux, so I'm worried other changes may pop up, and I'd like to know the source of the issue rather than try to isolate the issue and manually update the misinterpretations.

Best Answer

This is not a bug in SQL Server (or even in Windows), nor is it a situation that requires the additional step of converting the file into another encoding (i.e. into "Unicode", which in Windows-world means "UTF-16 Little Endian"). It is just a simple miscommunication.

The source of the communication breakdown (it's always the same, right ;-) is merely not agreeing on the nature of the source data. When moving character data from one place to another, it is important to specify the encoding on both sides. Yes, the SQL_Latin1_General_CP1_* Collations use Code Page 1252. However, if you don't tell BULK INSERT or BCP.exe what the Code Page of the source file is, then they will assume that the Code Page is the system default.

The documentation for BULK INSERT even states (for the CODEPAGE = argument):

'OEM' (default) = Columns of char, varchar, or text data type are converted from the system OEM code page to the SQL Server code page.

The documentation for BCP.exe states (for the -C switch):

OEM = Default code page used by the client. This is the default code page used if -C is not specified.

And the default Code Page for Windows is (at least for US English systems): 437. You can see this if you execute the following in a Command Prompt:

C:\> CHCP

It will return:

Active code page: 437

But your source file wasn't encoded using Code Page 437. It was encoded using Code Page 1252.

So here is what is happening:

  1. Bytes are bytes. And bytes representing character data can only be interpreted via an encoding. Anything reading a file doesn't read characters from the file, it reads the bytes of the file and displays the characters based on the encoding that is specified.
  2. BULK INSERT / BCP reads in byte 0xC9. 0xC9 displays as É when using Code Page 1252.
  3. BULK INSERT / BCP is not given a source Code Page, so it checks the current Code Page for the process and is told: 437.
  4. BULK INSERT / BCP now has a byte of 0xC9 for Code Page 437 (which displays as , but BULK INSERT / BCP isn't displaying it so you won't see this)
  5. BULK INSERT / BCP inserts this data into a column using a Collation that specifies Code Page 1252.
  6. SQL Server sees that the incoming data is using a different Code Page than the destination uses and so has to convert the incoming data such that the characters appear to be the same (as much as possible), even if the underlying values change.
  7. The Code Page 437 to Code Page 1252 mapping indicates that byte 0xC9 maps to byte 0x2B. Similarly, byte 0xAE (which is ® on Code Page 1252) on Code Page 437 (which displays as «) maps to byte 0xAB on Code Page 1252 (because it also displays as «).

The following example shows this conversion for all of the characters noted in the question:

DECLARE @CodePageConversion TABLE
(
   [ActualSource_CP1252] AS CONVERT(VARCHAR(10), CONVERT(BINARY(1),
                    [PerceivedSource_CP437])) COLLATE SQL_Latin1_General_CP1_CI_AS,

   [PerceivedSource_CP437] VARCHAR(10) COLLATE SQL_Latin1_General_CP437_CI_AS,

   [Source_Value] AS (CONVERT(BINARY(1), [PerceivedSource_CP437])),

   [Destination_CP1252] AS (CONVERT(VARCHAR(10), [PerceivedSource_CP437]
                  COLLATE SQL_Latin1_General_CP1_CI_AS)),

   [CP1252_Value] AS (CONVERT(BINARY(1), CONVERT(VARCHAR(10),
                  [PerceivedSource_CP437] COLLATE SQL_Latin1_General_CP1_CI_AS)))
);

INSERT INTO @CodePageConversion
VALUES      (0xC9), (0xA1), (0xA0), (0xAE), (0xCB), (0xD1), (0x92), (0x96);

SELECT * FROM @CodePageConversion;

which returns:

ActualSource_CP1252  PerceivedSource_CP437  Source_Value  Destination_CP1252  CP1252_Value
É                    ╔                      0xC9          +                   0x2B
¡                    í                      0xA1          í                   0xED
                     á                      0xA0          á                   0xE1
®                    «                      0xAE          «                   0xAB
Ë                    ╦                      0xCB          -                   0x2D
Ñ                    ╤                      0xD1          -                   0x2D
’                    Æ                      0x92          Æ                   0xC6
–                    û                      0x96          û                   0xFB

The characters for 0xC9, 0XCB, and 0xD1 do not exist in Code Page 1252, so the "best fit" mappings are used, which is why you end up with the + and - characters after the conversion.

Also, even if the destination column is using NVARCHAR, all of these mappings are the same, so you would be seeing the exact same behavior.

So, your choices are:

  1. If using the T-SQL BULK INSERT command, specify the WITH CODEPAGE = option with one of the following values:

    1. 'ACP' (this is the same as '1252')
    2. 'RAW' (this uses the Code Page of the column's Collation if inserting into VARCHAR, or is the same as 'OEM' / Code Page 437 when inserting into NVARCHAR)
    3. '1252' (this is the same as 'ACP')
  2. Or, if using BCP.exe, indicate that the incoming file uses Code Page 1252 via the -C command-line switch along with one of the following values (see notes in option #1):

    1. ACP
    2. RAW
    3. 1252

Please note that:

  1. I tested with BULK INSERT, inserting into a VARCHAR column, using the set of characters noted in the question, and the ACP (which I believe stands for ANSI Code Page), RAW, and 1252 values all produced the correct results.
  2. Not specifying WITH CODEPAGE = produced the same results that the O.P. reported in the question. This was the same as specifying WITH CODEPAGE = 'OEM'.
  3. When inserting into an NVARCHAR column, both ACP and 1252 worked as desired, but RAW produced the same result as OEM (i.e. using Code Page 437 instead of Code Page 1252 which is specified by the Collation of the column).
  4. I tested using BCP.exe, and not specifying the -C switch did not use the Code Page of the process. Meaning, using CHCP to change the command prompt's Code Page had no effect: Code Page 437 was still used as the source Code Page.

P.S. Since the data here is all 8-bit encoded, there are no "Unicode characters" since there is no Unicode being used.