Sql-server – Storing Japanese characters in a table

collationsql serverunicode

I am using SQL Server 2008 R2 and I want to store Japanese characters in one of the columns of my table.

Say I want to store Japanese names, how can I achieve that? Is there an easy way?

Best Answer

You need to use NCHAR(1 - 4000) or NVARCHAR, either as NVARCHAR(1 - 4000) or NVARCHAR(MAX) for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).

Technically, you can store Japanese characters in VARCHAR fields if you use a Japanese_* Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details about VARCHAR.

You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:

SELECT * FROM fn_helpcollations() WHERE name LIKE N'Japanese%';

And you use that value in the field specification like:

CREATE TABLE dbo.test
(
  JapaneseText NVARCHAR(3000) COLLATE Japanese_CI_AS_KS_WS
);

Please see the following section of MSDN pages for more info on using collation and what each of the CI/CS, AS/AI, KS, and WS mean, as well as BIN/BIN2 and SC (not shown above): Collation.

And depending on which characters you need to store, you might need to pay close attention to the collations ending with SC (i.e. "Supplementary Characters"). By default, NCHAR / NVARCHAR data is stored as UCS-2, which is very similar to UTF-16, but UCS-2 is always 2-bytes per character. On the other hand, UTF-16, in order to support more than 65,536 characters (max size of 2 bytes, or UInt16.MaxValue + 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.

Absolutely do not use NTEXT. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.

UPDATE

While not ideal, it is possible to store Japanese characters in CHAR / VARCHAR fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:

SELECT col.name
FROM   sys.fn_helpcollations() col
WHERE  COLLATIONPROPERTY(col.name, 'CodePage') = 932;

I did a simple test by creating a database with an entry from that list and ran the following statements:

SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation'); -- Japanese_Unicode_CI_AS
SELECT COLLATIONPROPERTY(N'Japanese_Unicode_CI_AS', 'CodePage'); -- 932
SELECT CONVERT(VARCHAR(50), 0x944094B294CD985198EE9AD79AA0); -- 如抜範浪偃壅國
SELECT LEN('如抜範浪偃壅國'), DATALENGTH('如抜範浪偃壅國'); -- 7, 14

This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the DATALENGTH is twice the the character LENgth and that the data is in a VARCHAR type since there is no N prefix on the string literals and the CONVERT was to VARCHAR, not NVARCHAR. There are 4 DBCS Code Pages supported in Windows / SQL Server:

932 = Japanese (Shift-JIS)
936 = Chinese Simplified (GB2312)
949 = Korean
950 = Chinese Traditional (Big5)

Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in NVARCHAR instead of VARCHAR.

Related Solutions

Sql-server – How to store more than 15,000 Japanese characters in a column

In this case only NVARCHAR(MAX) should be used. It's able to store up to 2^31-1 bytes of Unicode data (about 2 GB).

Sql-server – Why these characters are all equal in SQL Server

All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.

SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (_BIN suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_ prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort rules that closely align with the Windows OS whereas legacy collations consider only single characters.

The example below illustrates the differences between Windows and binary collation with the Teth character:

CREATE TABLE dbo.WindowsColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
    );

CREATE TABLE dbo.BinaryColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_BIN
    , Character2 nchar(1) COLLATE Arabic_100_BIN
    , Character3 nchar(1) COLLATE Arabic_100_BIN
    , Character4 nchar(1) COLLATE Arabic_100_BIN
    );

INSERT  INTO dbo.BinaryColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT  INTO dbo.WindowsColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );

--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
    character1 = character2
    OR character1 = character3
    OR character1 = character4
    OR character2 = character3
    OR character2 = character4
    OR character3 = character4;

--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;

Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character ﻁ is used in different languages (http://en.wikipedia.org/wiki/Teth).

Best Answer

Related Solutions

Sql-server – How to store more than 15,000 Japanese characters in a column

Sql-server – Why these characters are all equal in SQL Server

Related Question