Sql-server – Japanese characters in VARCHAR column using Japanese_Unicode_CI_AS

collationcsvimportsql servervarchar

I am loading a table with the Japanese names from CSV file that are changing into question marks into SQL Table. The table that is storing these values is a varchar column. I understand that the varchar column is not Unicode and that that's the reason it is changing some of the characters to ??.

However, how come existing value written in Japanese is stored in varchar while ideally it should be in nvarchar?

Is there a way to convert nvarchcar to varchar?

The database is out of our control and we cannot change the schema.

Best Answer

As Tibor mentioned, the Japanese_Unicode_CI_AS collation (all Japanese_* collations, actually) can store Japanese characters in VARCHAR columns due to Windows code page 932 being a Double-Byte character set (DBCS). I believe there are just under 7800 Japanese characters mapped to the Windows-932 code page. However, Unicode contains more than 7800 Japanese characters.

It would be very helpful (maybe even necessary) to know some things in order to narrow down the problem:

  1. Exactly how you are importing the CSV file. Are you using BCP or BULK INSERT / OPENROWSET(BULK...)?
  2. An example of at least one Japanese name that is getting converted to ??
  3. You say both that the "Japanese names from CSV file that are changing into question marks" and "it is changing some of the characters to ??", so:
    1. Are all Japanese characters changing into question marks, or only some?
    2. Are the Japanese characters changing into one question mark ( ? ), two ( ?? ), or some of each?

Without knowing the answers to those questions, I can say that there are two main possibilities:

  1. (all Japanese characters imported as ?):   You aren't telling the import tool what the encoding of the CSV file is. Is it encoded as Windows-932 (or possibly Windows-31J)? Or is it a Unicode encoding such as UTF-8 or UTF-16 (which might be listed as "UCS-2" or "Unicode" depending on the tool)? If you are using BCP, you would need to use the command-line options of -c -C 932 for Windows-932, or -c -C 65001 for UTF-8. This problem should be fixable simply by setting the tool to use the correct code page.

  2. (some Japanese characters imported as ? or ??):   If you are telling the import tool the correct encoding of the file, there are still Japanese characters that are not encoded in the Windows-932 code page. For example:

    -- DROP TABLE ##BCP;
    CREATE TABLE ##BCP ([Value] VARCHAR(50) COLLATE Japanese_Unicode_CI_AS);
    INSERT INTO ##BCP ([Value]) VALUES (N'ヤ :: ㋾ :: ?');
    SELECT * FROM ##BCP;
    

    returns:

    ヤ :: ? :: ??
    

    This problem can only be fixed by doing one of the following:

    1. Change the datatype of the column to be NVARCHAR (though you said that you can't change the schema)
    2. Change the collation of the column to use a UTF-8 collation (i.e. collation names ending in _UTF8; introduced in SQL Server 2019)

For info on working with collations / Unicode / encodings, please visit: Collations Info