Sql-server – Importing Chinese characters via SSIS

collationimportsql serverssisunicode

I'm trying to import file that contains Chinese characters into a table. File encoding is Big5 (Traditional).

Here is a sample file like one that I need to import: https://www.pastiebin.com/5d7782d9b63fa

Table where file has to be imported into has such a structure:

create table dbo.Test (
      AccountId   numeric(18, 0) not null
    , Province    nvarchar(50)       null
    , City        nvarchar(50)       collate Chinese_Hong_Kong_Stroke_90_CI_AI
    , Country     nvarchar(50)       null
    , Gender      nvarchar(50)       null
)

If I import using OPENROWSET/BULK then all the data is transferred correctly:

select AccountId, Province, City, Country, Gender
from openrowset (
      bulk 'C:\chinese_sample.dat'
    , firstrow = 1
    , formatfile = 'C:\chinese_sample.xml'
) t

Here is a format file that I use: https://www.pastiebin.com/5d7783396f9e4

enter image description here


But if I try to import file using SSIS, then Chinese characters are not parsed correctly.

In Flat File Source I use DataType string [DT_STR] and CodePage 950. Then convert it to Unicode string [DT_WSTR].

enter image description here

Here is how text is imported into table:

enter image description here

As we can see some characters are parsed correctly and some are not.
What am I missing?

Best Answer

The problem appears to be that the lines that are not importing correctly have characters that, encoded in code page 950, having a trailing byte of 0x7C, which is valid, but also happens to be the pipe symbol, which you are using as a delimiter. For example:

SELECT CONVERT(VARBINARY(20), CONVERT(VARCHAR(20), N'四會'
                                      COLLATE Chinese_Hong_Kong_Stroke_90_CI_AI));
-- 0xA57C B77C

Each of those two characters has a trail-byte of 0x7C. The parser seems to be viewing that as a delimiter instead of part of a 2-byte sequence for a code page 950 character. This is why you get a "?" in both "City" and "Country" with the remaining input line in the "Gender" column. The "?" in those two columns is due to the lead bytes 0xA5 and 0xB7 not being valid on their own.

The next line has the same issue:

SELECT CONVERT(VARBINARY(20), CONVERT(VARCHAR(20), N'元朗大坑|'
                              COLLATE Chinese_Hong_Kong_Stroke_90_CI_AI))
-- 0xA4B8 AED4 A46A A77C 7C

This time I kept the delimiter that comes at the end of the "City" value (so it would be clearer that the delimiter is the same byte value as the trail byte of that 4th character). The 4th character, , is encoded as 0xA77C in code page 950. This is why only the 4th character shows up as "?".

So, this might be a bug in SSIS. Or, perhaps it's a configuration issue. Is there a way to indicate that the entire file, not just this one column, is code page 950? Dealing with code pages on a column-by-column basis, in terms of a text file, makes no sense. The entire file is encoded as code page 950, not just that one column. If it's possible to change the delimiter that might help, but it most likely just delays the problem as any delimiter can be encoded as a valid trail-byte value for double-byte characters. Given that OPENROWSET works correctly, I have to believe that this also can (although you aren't setting the file's encoding for OPENROWSET, just the column, so this could still be an SSIS bug).

Please try the following to set the file's encoding within SSIS:

  1. Go to the Flat File Connection Manager Editor (General Page)
  2. Under "Code Page", enter in 950.
  3. Make sure that "Unicode" is not selected.

Also, check the "CITY" column under "Output Columns" and make sure that it is using "Standard Parse" and not "Fast Parse" (since "Fast Parse" is locale-insensitive).