All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.
SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (_BIN
suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_
prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort
rules that closely align with the Windows OS whereas legacy collations consider only single characters.
The example below illustrates the differences between Windows and binary collation with the Teth character:
CREATE TABLE dbo.WindowsColationExample
(
Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
, Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
);
CREATE TABLE dbo.BinaryColationExample
(
Character1 nchar(1) COLLATE Arabic_100_BIN
, Character2 nchar(1) COLLATE Arabic_100_BIN
, Character3 nchar(1) COLLATE Arabic_100_BIN
, Character4 nchar(1) COLLATE Arabic_100_BIN
);
INSERT INTO dbo.BinaryColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT INTO dbo.WindowsColationExample
VALUES ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
character1 = character2
OR character1 = character3
OR character1 = character4
OR character2 = character3
OR character2 = character4
OR character3 = character4;
--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;
Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character ﻁ
is used in different languages (http://en.wikipedia.org/wiki/Teth).
Best Answer
You need to use
NCHAR(1 - 4000)
orNVARCHAR
, either asNVARCHAR(1 - 4000)
orNVARCHAR(MAX)
for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).Technically, you can store Japanese characters in
VARCHAR
fields if you use aJapanese_*
Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details aboutVARCHAR
.You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:
And you use that value in the field specification like:
Please see the following section of MSDN pages for more info on using collation and what each of the
CI
/CS
,AS
/AI
,KS
, andWS
mean, as well asBIN
/BIN2
andSC
(not shown above): Collation.And depending on which characters you need to store, you might need to pay close attention to the collations ending with
SC
(i.e. "Supplementary Characters"). By default,NCHAR
/NVARCHAR
data is stored asUCS-2
, which is very similar toUTF-16
, butUCS-2
is always 2-bytes per character. On the other hand,UTF-16
, in order to support more than 65,536 characters (max size of 2 bytes, orUInt16.MaxValue
+ 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.Absolutely do not use
NTEXT
. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.UPDATE
While not ideal, it is possible to store Japanese characters in
CHAR
/VARCHAR
fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:I did a simple test by creating a database with an entry from that list and ran the following statements:
This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the
DATALENGTH
is twice the the characterLEN
gth and that the data is in aVARCHAR
type since there is noN
prefix on the string literals and theCONVERT
was toVARCHAR
, notNVARCHAR
. There are 4 DBCS Code Pages supported in Windows / SQL Server:Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in
NVARCHAR
instead ofVARCHAR
.