Sql-server – Nvarchar to varchar to nvarchar characters conversion

alter-tablecollationencodingsql serverunicode

I had a table filled with some greek text rows as nvarchar(2000).

Lately, I altered the type of the column to varchar(4000) but I realized that some greek characters were shown as question marks.

So, I tried to change back to nvarchar(4000) in order to fix it as I suppose that the characters' unicode still remains the same.

I am just wondering, is there any way to fix this instead of restoring the backup I created before altering the table?

Best Answer

No, there is no way to "fix" the data because the data is no longer there. When you converted to VARCHAR, the underlying values for each character were changed into the ASCII value for ?. This is not a display issue, those characters are now physically a regular question mark. You will need to do a restore from a backup, unfortunately.

The following example code shows that once a Unicode character is converted to VARCHAR (assuming the Code Page denoted by the Collation does not support that character), that it becomes a regular 'ol question mark, and forever shall it remain as such:

DECLARE @Character NCHAR(1) = NCHAR(0x3525);
SELECT @Character AS [TheCharacter],
       UNICODE(@Character) AS [DecimalCodePoint],
       ASCII(@Character) AS [DecimalValueOfVarchar],
       UNICODE(CONVERT(VARCHAR(5), @Character)) AS [ConvertToVarchar], 
       UNICODE(CONVERT(NVARCHAR(5), CONVERT(VARCHAR(5), @Character)))
             AS [ConvertToVarcharAndBack], 
       ASCII('?') AS [VarcharQuestionMark],
       UNICODE(N'?') AS [UnicodeQuestionMark];

-- 㔥    13605   63  63  63  63  63

The following example shows an instance of a Unicode character that is highly doubtful (at least at this time) to be supported in most fonts, hence it appears as a square box, but the UNICODE built-n function shows that the underlying code is still the correct Unicode Code Point:

SELECT NCHAR(0xABBF), N'ꮿ', UNICODE(N'ꮿ');

-- ꮿ    ꮿ   43967

The actual character can be seen here: Cherokee Small Letter YA U+ABBF. This is a display issue, and many characters that are not represented in various fonts will display in the same manner without altering the actual value of the character, but they are still distinct characters.

Related Solutions

Sql-server – Automatic Translation when Converting Unicode to non-Unicode / NVARCHAR to VARCHAR

Why does SQL convert Unicode 9619 to ASCII code 166?

SQL Server is not employing any special custom logic here; it is using standard operating system services to perform the conversion.

Specifically, the SQL Server type and expression service (sqlTsEs) calls into OS routine WideCharToMultiByte in kernel32.dll. SQL Server sets the input parameters to WideCharToMultiByte such that the routine performs a 'quick translation'. This is faster than requesting a specific default character be used when no direct translation exists.

The quick translation relies on the target code page to perform a best-fit mapping for any unmatched characters, as mentioned in the link Martin Smith provided in a comment to the question:

Best-fit strategies vary for different code pages, and they are not documented in detail.

When the input parameters are set for a quick translation, WideCharToMultiByte calls OS service GetMBNoDefault (source). Inspecting the SQL Server call stack when performing the conversion specified in the question confirms this:

SQL Server stack trace

Sql-server – Storing Japanese characters in a table

You need to use NCHAR(1 - 4000) or NVARCHAR, either as NVARCHAR(1 - 4000) or NVARCHAR(MAX) for storing anywhere from 4001 to just over 1,073,741,822 characters (or possibly less if storing any supplementary characters as described below).

Technically, you can store Japanese characters in VARCHAR fields if you use a Japanese_* Collation that is associated with Code Page 932. However, that is considered a "legacy" approach and would still leave you with some issues. The appropriate way to handle this is to use a Unicode datatype as mentioned above. Please see the UPDATE section at the end for details about VARCHAR.

You will also want to specify a Japanese collation so that the data compares and sorts as expected. You can find the available Japanese collations using:

SELECT * FROM fn_helpcollations() WHERE name LIKE N'Japanese%';

And you use that value in the field specification like:

CREATE TABLE dbo.test
(
  JapaneseText NVARCHAR(3000) COLLATE Japanese_CI_AS_KS_WS
);

Please see the following section of MSDN pages for more info on using collation and what each of the CI/CS, AS/AI, KS, and WS mean, as well as BIN/BIN2 and SC (not shown above): Collation.

And depending on which characters you need to store, you might need to pay close attention to the collations ending with SC (i.e. "Supplementary Characters"). By default, NCHAR / NVARCHAR data is stored as UCS-2, which is very similar to UTF-16, but UCS-2 is always 2-bytes per character. On the other hand, UTF-16, in order to support more than 65,536 characters (max size of 2 bytes, or UInt16.MaxValue + 1) can store characters that are 4 bytes (known as "surrogate pairs"). Please see the following MSDN page on Collation and Unicode Support ("Supplementary Characters" section) for more details.

Absolutely do not use NTEXT. That has been deprecated since SQL Server 2005 came out! There is no benefit / reason for using it and, in fact, there are several drawbacks.

UPDATE

While not ideal, it is possible to store Japanese characters in CHAR / VARCHAR fields and variables. Doing so requires that the Database's default Collation be set to one that is associated with Code Page 932 (Shift-JIS). You can find that list of Collations by running the following query:

SELECT col.name
FROM   sys.fn_helpcollations() col
WHERE  COLLATIONPROPERTY(col.name, 'CodePage') = 932;

I did a simple test by creating a database with an entry from that list and ran the following statements:

SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation'); -- Japanese_Unicode_CI_AS
SELECT COLLATIONPROPERTY(N'Japanese_Unicode_CI_AS', 'CodePage'); -- 932
SELECT CONVERT(VARCHAR(50), 0x944094B294CD985198EE9AD79AA0); -- 如抜範浪偃壅國
SELECT LEN('如抜範浪偃壅國'), DATALENGTH('如抜範浪偃壅國'); -- 7, 14

This works because Code Page 932 is a Double-Byte Character Set (DBCS), which is different than UCS-2 / UTF-16 which is also double-byte. A DBCS character set is one that is double-byte within an 8-bit encoding (like the Extended ASCII Code Pages). You can see in that last query that the DATALENGTH is twice the the character LENgth and that the data is in a VARCHAR type since there is no N prefix on the string literals and the CONVERT was to VARCHAR, not NVARCHAR. There are 4 DBCS Code Pages supported in Windows / SQL Server:

932 = Japanese (Shift-JIS)
936 = Chinese Simplified (GB2312)
949 = Korean
950 = Chinese Traditional (Big5)

Only use these if you absolutely need to, such as supporting interaction with a legacy system. Of course, the Collations are still fine to use, but store the data in NVARCHAR instead of VARCHAR.

Best Answer

Related Solutions

Sql-server – Automatic Translation when Converting Unicode to non-Unicode / NVARCHAR to VARCHAR

Sql-server – Storing Japanese characters in a table

Related Question