Sql-server – Automatic Translation when Converting Unicode to non-Unicode / NVARCHAR to VARCHAR

collationencodingsql serverunicode

Unicode code point 9619 is a character called "Dark shade": ▓ (http://unicode-table.com/en/search/?q=9619).

Using the SQL_Latin1_General_CP1_CI_AS collation and 1252 code page, I would expect that casting / converting that Unicode character to non-Unicode data type would result in a question mark (?) as code page 1252 does not appear to contain this character and this appears to be SQL Server's behavior when conversion can not take place.

So my question is: why does SQL Server convert this character to an ASCII code 166 which is "Pipe, Broken vertical bar": ¦ ?

SELECT NCHAR(9619), CAST(NCHAR(9619) AS CHAR(1)), ASCII(CAST(NCHAR(9619) AS CHAR(1)))

Best Answer

Why does SQL convert Unicode 9619 to ASCII code 166?

SQL Server is not employing any special custom logic here; it is using standard operating system services to perform the conversion.

Specifically, the SQL Server type and expression service (sqlTsEs) calls into OS routine WideCharToMultiByte in kernel32.dll. SQL Server sets the input parameters to WideCharToMultiByte such that the routine performs a 'quick translation'. This is faster than requesting a specific default character be used when no direct translation exists.

The quick translation relies on the target code page to perform a best-fit mapping for any unmatched characters, as mentioned in the link Martin Smith provided in a comment to the question:

Best-fit strategies vary for different code pages, and they are not documented in detail.

When the input parameters are set for a quick translation, WideCharToMultiByte calls OS service GetMBNoDefault (source). Inspecting the SQL Server call stack when performing the conversion specified in the question confirms this:

SQL Server stack trace

Related Solutions

Sql-server – Why does MS SQL Server return a result for empty string check when Unicode string is not empty

~~While I am not sure of the exact reason for those specific characters, t~~ The issue has to do with the older collations ^{(please see UPDATE section at the end)}. And it is not just empty string that they equate to, but also to just one of those characters:

SELECT * FROM (SELECT N'ግዜ') tab(col) WHERE tab.col = N'ግ';

And if you try a case-sensitive collation, even with multiple characters, they still equate:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግግግግ' COLLATE SQL_Latin1_General_CP1_CS_AS;
SELECT * FROM (SELECT N'ግ') t(c) WHERE t.c = N'ዜዜዜዜ' COLLATE SQL_Latin1_General_CP1_CS_AS;
SELECT * FROM (SELECT N'ዜ') t(c) WHERE t.c = N'ግግግግ' COLLATE SQL_Latin1_General_CP1_CS_AS;

Even the "equivalent" Windows Collations have the same issue:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_CS_AS;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_CS_AS_KS_WS;

BUT, it seems that newer versions of the Windows Collations (i.e. the 100 series or newer) "fixes" the issue and these no longer equate:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_CI_AI;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_CI_AS;

And, of course, the binary Windows Collations (both older and newer series) work just fine as the following do not report a match:

SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_BIN;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_BIN2;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_BIN;
SELECT * FROM (SELECT N'ግዜ') t(c) WHERE t.c = N'ግ' COLLATE Latin1_General_100_BIN2;

UPDATE (2015-08-20)
After 6 hours of pouring through documentation on http://www.unicode.org/, http://site.icu-project.org/, and a couple of other Unicode-related sites, I gave up trying to find evidence of a "weighting" change that might have occurred just prior to 2008 (the new 100 series of collations were introduced in SQL Server 2008). I did, however, find the following info at www.fileformat.info for the two characters being tested here:

So, I moved on to the next project and moments later came across the following on the SQL Server 2008 MSDN page for Collation and Unicode Support:

SQL Server 2008 has introduced new collations that are in full alignment with collations that Windows Server 2008 provides. These 80 new collations are denoted by *_100 version references. They provide users with the most up-to-date and linguistically accurate cultural sorting conventions. Support includes the following:

...

Weighting has been added to previously non-weighted characters that would have compared equally.

No sort weight for a character means that it is effectively invisible.

Moral of the story: don't try so hard; give up sooner ;-)

UPDATE (2018-09-20)

For a more visual indication of what is going on, the query below compares each BMP character (Code Points 0 - 65535 / U+0000 - U+FFFF) to an empty string. The comparison is repeated using different collations: BIN2, a SQL Server Collations, Latin1_General that started with SQL Server 2000, Latin1_General that started with SQL Server 2008, Japanese_XJIS that started with SQL Server 2008, and Japanese_XJIS that started with SQL Server 2017. The two Collations starting in SQL Server 2008 are showing that both return the same number of matches, yet the newer Japanese_XJIS Collation returns a different number (the only Collations updated in SQL Server 2017 are the Japanese Collations). This is done to show how many characters are missing sort weights across the various Collation versions.

;WITH nums AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [CodePoint]
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
)
SELECT nums.[CodePoint],
       COALESCE(NCHAR(nums.[CodePoint]), N'TOTALS:') AS [Character],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_BIN2) THEN 1 END) AS [BIN2],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE SQL_Latin1_General_CP1_CS_AS) THEN 1 END) AS [SQL Collations],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_CS_AS_KS_WS) THEN 1 END) AS [SQL2000 Latin1],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Latin1_General_100_CS_AS_KS_WS) THEN 1 END) AS [SQL2008 Latin1],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Japanese_XJIS_100_CS_AS_KS_WS) THEN 1 END) AS [SQL2008 Japanese],
       COUNT(CASE WHEN (NCHAR(nums.[CodePoint]) = N''
                 COLLATE Japanese_XJIS_140_CS_AS_KS_WS) THEN 1 END) AS [SQL2017 Japanese]
FROM   nums
GROUP BY ROLLUP ((nums.[CodePoint], NCHAR(nums.[CodePoint])));

To see the details for all rows, execute the query above. But for just the summary, that is:

BIN2  SQL Collations SQL2000 Latin1  SQL2008 Latin1  SQL2008 Japanese  SQL2017 Japanese
1     21230          21229           5840            5840              3375

Sql-server – How to efficiently downsize some Unicode fields

Regarding:

I tested doing an “add new column, set new=old, drop old, rename old to new” for both of the main two fields I want to convert from nvarchar(max) to varchar(max), and it took 81 minutes on our test server ... before running out of disk space ... it was too slow.

and

Another technique I plan to try is to create the new table def in a new schema, select into it from the old table, then move tables amongst schema and delete the old table.

Generally speaking, making a copy of the table with the ideal schema is my preferred approach. But, if you only now have maybe enough space to convert the two columns, are you sure you have enough space to make a copy of the entire table?

Also, the new table would just need to have a different name. It wouldn't need to be in a different schema.

Since you are on Enterprise Edition, have you considered or even looked at enabling Data Compression? Not only would it have the effect you are looking for on the NCHAR / NVARCHAR fields, but it would also save space on other fields with other types as well.

There are two types of Compression: Row and Page. You should read up on them and run the stored procedure that estimates what your savings would be.

Enabling compression can be done as an ONLINE operation, but might require a bit of disk space. If you do not have the available space for this, then you might could consider a hybrid approach where you build a copy of the table as TableNEW, and have the clustered index already created, and created with Compression enabled. Then you should be able to slowly fill TableNEW and the data will compress as it goes in. Of course, you will want to use INSERT INTO...SELECT to do it in batches. And the full benefit of the compression might not be realized until after you drop the original table and do a full rebuild on the Clutered Index of TableNEW.

Main MSDN documentation for Data Compression.
Enable Compression on a Table or Index
sp_estimate_data_compression_savings (mentioned in the "Enable Compression" document listed directly above)

And keep in mind that there are scenarios where you might not save that much space, or saving the space isn't worth the increase in CPU activity. But, that all depends on a lot of factors so it is really something that should be tested on your system.

You could always take the approach of:

Enable compression, either directly to the current table as ONLINE (if there is enough space to support it), or into a separate table that has Compression enabled.
IF you find that CPU actually has increased beyond the benefit of the space savings, then you have the option of building the table again with regular VARCHAR fields and no Compression. Because compression was already enabled, you should definitely have room for this now.

But again, like anything we do, it should be tested. I have heard for years how horrible "XML parsing" is on CPU, and how bad Compression is supposed to be, but in practice, those concerns are quite often overstated. The only way to know is to test on your system. (ps, just in case it is not clear being a text-only medium, these final statements are not attacking what @Kin said in his answer about needing to be cautious of CPU activity increase. He is correct, at least to a degree. I am just reminding everyone to be sure to put everything into perspective of current hardware and software and system setup).

Best Answer

Related Solutions

Sql-server – Why does MS SQL Server return a result for empty string check when Unicode string is not empty

Sql-server – How to efficiently downsize some Unicode fields

Related Question