Why does SQL convert Unicode 9619 to ASCII code 166?
SQL Server is not employing any special custom logic here; it is using standard operating system services to perform the conversion.
Specifically, the SQL Server type and expression service (sqlTsEs
) calls into OS routine WideCharToMultiByte
in kernel32.dll
. SQL Server sets the input parameters to WideCharToMultiByte
such that the routine performs a 'quick translation'. This is faster than requesting a specific default character be used when no direct translation exists.
The quick translation relies on the target code page to perform a best-fit mapping for any unmatched characters, as mentioned in the link Martin Smith provided in a comment to the question:
Best-fit strategies vary for different code pages, and they are not documented in detail.
When the input parameters are set for a quick translation, WideCharToMultiByte
calls OS service GetMBNoDefault
(source). Inspecting the SQL Server call stack when performing the conversion specified in the question confirms this:
You need to use NCHAR(1 - 4000)
or NVARCHAR
, either as NVARCHAR(1 - 4000)
or NVARCHAR(MAX)
for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).
Technically, you can store Japanese characters in VARCHAR
fields if you use a Japanese_*
Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details about VARCHAR
.
You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:
SELECT * FROM fn_helpcollations() WHERE name LIKE N'Japanese%';
And you use that value in the field specification like:
CREATE TABLE dbo.test
(
JapaneseText NVARCHAR(3000) COLLATE Japanese_CI_AS_KS_WS
);
Please see the following section of MSDN pages for more info on using collation and what each of the CI
/CS
, AS
/AI
, KS
, and WS
mean, as well as BIN
/BIN2
and SC
(not shown above): Collation.
And depending on which characters you need to store, you might need to pay close attention to the collations ending with SC
(i.e. "Supplementary Characters"). By default, NCHAR
/ NVARCHAR
data is stored as UCS-2
, which is very similar to UTF-16
, but UCS-2
is always 2-bytes per character. On the other hand, UTF-16
, in order to support more than 65,536 characters (max size of 2 bytes, or UInt16.MaxValue
+ 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.
Absolutely do not use NTEXT
. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.
UPDATE
While not ideal, it is possible to store Japanese characters in CHAR
/ VARCHAR
fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:
SELECT col.name
FROM sys.fn_helpcollations() col
WHERE COLLATIONPROPERTY(col.name, 'CodePage') = 932;
I did a simple test by creating a database with an entry from that list and ran the following statements:
SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation'); -- Japanese_Unicode_CI_AS
SELECT COLLATIONPROPERTY(N'Japanese_Unicode_CI_AS', 'CodePage'); -- 932
SELECT CONVERT(VARCHAR(50), 0x944094B294CD985198EE9AD79AA0); -- 如抜範浪偃壅國
SELECT LEN('如抜範浪偃壅國'), DATALENGTH('如抜範浪偃壅國'); -- 7, 14
This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the DATALENGTH
is twice the the character LEN
gth and that the data is in a VARCHAR
type since there is no N
prefix on the string literals and the CONVERT
was to VARCHAR
, not NVARCHAR
. There are 4 DBCS Code Pages supported in Windows / SQL Server:
- 932 = Japanese (Shift-JIS)
- 936 = Chinese Simplified (GB2312)
- 949 = Korean
- 950 = Chinese Traditional (Big5)
Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in NVARCHAR
instead of VARCHAR
.
Best Answer
No, there is no way to "fix" the data because the data is no longer there. When you converted to
VARCHAR
, the underlying values for each character were changed into the ASCII value for?
. This is not a display issue, those characters are now physically a regular question mark. You will need to do a restore from a backup, unfortunately.The following example code shows that once a Unicode character is converted to
VARCHAR
(assuming the Code Page denoted by the Collation does not support that character), that it becomes a regular 'ol question mark, and forever shall it remain as such:The following example shows an instance of a Unicode character that is highly doubtful (at least at this time) to be supported in most fonts, hence it appears as a square box, but the
UNICODE
built-n function shows that the underlying code is still the correct Unicode Code Point:The actual character can be seen here: Cherokee Small Letter YA U+ABBF. This is a display issue, and many characters that are not represented in various fonts will display in the same manner without altering the actual value of the character, but they are still distinct characters.