Sql-server – Storing Japanese characters in a table

collationsql serverunicode

I am using SQL Server 2008 R2 and I want to store Japanese characters in one of the columns of my table.

Say I want to store Japanese names, how can I achieve that? Is there an easy way?

Best Answer

You need to use NCHAR(1 - 4000) or NVARCHAR, either as NVARCHAR(1 - 4000) or NVARCHAR(MAX) for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).

Technically, you can store Japanese characters in VARCHAR fields if you use a Japanese_* Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details about VARCHAR.

You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:

SELECT * FROM fn_helpcollations() WHERE name LIKE N'Japanese%';

And you use that value in the field specification like:

CREATE TABLE dbo.test
(
  JapaneseText NVARCHAR(3000) COLLATE Japanese_CI_AS_KS_WS
);

Please see the following section of MSDN pages for more info on using collation and what each of the CI/CS, AS/AI, KS, and WS mean, as well as BIN/BIN2 and SC (not shown above): Collation.

And depending on which characters you need to store, you might need to pay close attention to the collations ending with SC (i.e. "Supplementary Characters"). By default, NCHAR / NVARCHAR data is stored as UCS-2, which is very similar to UTF-16, but UCS-2 is always 2-bytes per character. On the other hand, UTF-16, in order to support more than 65,536 characters (max size of 2 bytes, or UInt16.MaxValue + 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.


Absolutely do not use NTEXT. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.


UPDATE

While not ideal, it is possible to store Japanese characters in CHAR / VARCHAR fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:

SELECT col.name
FROM   sys.fn_helpcollations() col
WHERE  COLLATIONPROPERTY(col.name, 'CodePage') = 932;

I did a simple test by creating a database with an entry from that list and ran the following statements:

SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation'); -- Japanese_Unicode_CI_AS
SELECT COLLATIONPROPERTY(N'Japanese_Unicode_CI_AS', 'CodePage'); -- 932
SELECT CONVERT(VARCHAR(50), 0x944094B294CD985198EE9AD79AA0); -- 如抜範浪偃壅國
SELECT LEN('如抜範浪偃壅國'), DATALENGTH('如抜範浪偃壅國'); -- 7, 14

This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the DATALENGTH is twice the the character LENgth and that the data is in a VARCHAR type since there is no N prefix on the string literals and the CONVERT was to VARCHAR, not NVARCHAR. There are 4 DBCS Code Pages supported in Windows / SQL Server:

  • 932 = Japanese (Shift-JIS)
  • 936 = Chinese Simplified (GB2312)
  • 949 = Korean
  • 950 = Chinese Traditional (Big5)

Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in NVARCHAR instead of VARCHAR.