Sql-server – Importing Chinese characters via SSIS

collationimportsql serverssisunicode

I'm trying to import file that contains Chinese characters into a table. File encoding is Big5 (Traditional).

Here is a sample file like one that I need to import: https://www.pastiebin.com/5d7782d9b63fa

Table where file has to be imported into has such a structure:

create table dbo.Test (
      AccountId   numeric(18, 0) not null
    , Province    nvarchar(50)       null
    , City        nvarchar(50)       collate Chinese_Hong_Kong_Stroke_90_CI_AI
    , Country     nvarchar(50)       null
    , Gender      nvarchar(50)       null
)

If I import using OPENROWSET/BULK then all the data is transferred correctly:

select AccountId, Province, City, Country, Gender
from openrowset (
      bulk 'C:\chinese_sample.dat'
    , firstrow = 1
    , formatfile = 'C:\chinese_sample.xml'
) t

Here is a format file that I use: https://www.pastiebin.com/5d7783396f9e4

But if I try to import file using SSIS, then Chinese characters are not parsed correctly.

In Flat File Source I use DataType string [DT_STR] and CodePage 950. Then convert it to Unicode string [DT_WSTR].

Here is how text is imported into table:

As we can see some characters are parsed correctly and some are not.
What am I missing?

Best Answer

The problem appears to be that the lines that are not importing correctly have characters that, encoded in code page 950, having a trailing byte of 0x7C, which is valid, but also happens to be the pipe symbol, which you are using as a delimiter. For example:

SELECT CONVERT(VARBINARY(20), CONVERT(VARCHAR(20), N'四會'
                                      COLLATE Chinese_Hong_Kong_Stroke_90_CI_AI));
-- 0xA57C B77C

Each of those two characters has a trail-byte of 0x7C. The parser seems to be viewing that as a delimiter instead of part of a 2-byte sequence for a code page 950 character. This is why you get a "?" in both "City" and "Country" with the remaining input line in the "Gender" column. The "?" in those two columns is due to the lead bytes 0xA5 and 0xB7 not being valid on their own.

The next line has the same issue:

SELECT CONVERT(VARBINARY(20), CONVERT(VARCHAR(20), N'元朗大坑|'
                              COLLATE Chinese_Hong_Kong_Stroke_90_CI_AI))
-- 0xA4B8 AED4 A46A A77C 7C

This time I kept the delimiter that comes at the end of the "City" value (so it would be clearer that the delimiter is the same byte value as the trail byte of that 4th character). The 4th character, 坑, is encoded as 0xA77C in code page 950. This is why only the 4th character shows up as "?".

So, this might be a bug in SSIS. Or, perhaps it's a configuration issue. Is there a way to indicate that the entire file, not just this one column, is code page 950? Dealing with code pages on a column-by-column basis, in terms of a text file, makes no sense. The entire file is encoded as code page 950, not just that one column. If it's possible to change the delimiter that might help, but it most likely just delays the problem as any delimiter can be encoded as a valid trail-byte value for double-byte characters. Given that OPENROWSET works correctly, I have to believe that this also can (although you aren't setting the file's encoding for OPENROWSET, just the column, so this could still be an SSIS bug).

Please try the following to set the file's encoding within SSIS:

Go to the Flat File Connection Manager Editor (General Page)
Under "Code Page", enter in 950.
Make sure that "Unicode" is not selected.

Also, check the "CITY" column under "Output Columns" and make sure that it is using "Standard Parse" and not "Fast Parse" (since "Fast Parse" is locale-insensitive).

Related Solutions

SSIS – Excel destination does not allow characters more than 255

SSIS Excel Data Source:

Error with output column “Comments” (5691) on output “Excel Source Output” (5596).

The column status returned was:

“Text was truncated or one or more characters had no match in the target code page.”

Reason for this error:

The error is due to the lengthy data in the excel.
SSIS has an inbuilt logic to scan the data in the spread sheet. It will scan the first 8 rows and based on that it will written a logic to build the table's logic for the package. If your lengthy data in not in the 8 row then your ssis wont respond it.

Considering am having a column named "Name" First 8 rows is of length <255 9th row is of length > 255. While executing the package you will get the above error, its because your input value will be truncated and SSIS wont allow for that. In this case, Two types of logics can be followed,

Option1 : Its tricky one, just copy the 9th row to the top (may be as 1 row).Now try to create your package using import/export wizard. Your ssis will create column with width nvarchar(Max) which will accept upto 2 GB.

Option2: Change the Preparation SQL task query, change the data type as varchar(Max) and modify the excel source ->advanced editor and change the output columns type and length and external column type and length.

Please read the blog

Sql-server – Storing Japanese characters in a table

You need to use NCHAR(1 - 4000) or NVARCHAR, either as NVARCHAR(1 - 4000) or NVARCHAR(MAX) for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).

Technically, you can store Japanese characters in VARCHAR fields if you use a Japanese_* Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details about VARCHAR.

You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:

SELECT * FROM fn_helpcollations() WHERE name LIKE N'Japanese%';

And you use that value in the field specification like:

CREATE TABLE dbo.test
(
  JapaneseText NVARCHAR(3000) COLLATE Japanese_CI_AS_KS_WS
);

Please see the following section of MSDN pages for more info on using collation and what each of the CI/CS, AS/AI, KS, and WS mean, as well as BIN/BIN2 and SC (not shown above): Collation.

And depending on which characters you need to store, you might need to pay close attention to the collations ending with SC (i.e. "Supplementary Characters"). By default, NCHAR / NVARCHAR data is stored as UCS-2, which is very similar to UTF-16, but UCS-2 is always 2-bytes per character. On the other hand, UTF-16, in order to support more than 65,536 characters (max size of 2 bytes, or UInt16.MaxValue + 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.

Absolutely do not use NTEXT. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.

UPDATE

While not ideal, it is possible to store Japanese characters in CHAR / VARCHAR fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:

SELECT col.name
FROM   sys.fn_helpcollations() col
WHERE  COLLATIONPROPERTY(col.name, 'CodePage') = 932;

I did a simple test by creating a database with an entry from that list and ran the following statements:

SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation'); -- Japanese_Unicode_CI_AS
SELECT COLLATIONPROPERTY(N'Japanese_Unicode_CI_AS', 'CodePage'); -- 932
SELECT CONVERT(VARCHAR(50), 0x944094B294CD985198EE9AD79AA0); -- 如抜範浪偃壅國
SELECT LEN('如抜範浪偃壅國'), DATALENGTH('如抜範浪偃壅國'); -- 7, 14

This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the DATALENGTH is twice the the character LENgth and that the data is in a VARCHAR type since there is no N prefix on the string literals and the CONVERT was to VARCHAR, not NVARCHAR. There are 4 DBCS Code Pages supported in Windows / SQL Server:

932 = Japanese (Shift-JIS)
936 = Chinese Simplified (GB2312)
949 = Korean
950 = Chinese Traditional (Big5)

Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in NVARCHAR instead of VARCHAR.

Best Answer

Related Solutions

SSIS – Excel destination does not allow characters more than 255

Sql-server – Storing Japanese characters in a table

Related Question