I want to set a Unicode string variable to particular character based on its Unicode code point.
I want to use a code point beyond 65535, but the SQL Server 2008 R2 database has a collation of SQL_Latin1_General_CP1_CI_AS
.
According to Microsoft's NCHAR documentation, the NCHAR
function takes an integer as follows:
integer_expression
When the collation of the database does not contain the supplementary
character (SC) flag, this is a positive whole number from 0 through
65535 (0 through 0xFFFF). If a value outside this range is specified,
NULL is returned. For more information about supplementary characters,
see Collation and Unicode Support.When the collation of the database supports the supplementary
character (SC) flag, this is a positive whole number from 0 through
1114111 (0 through 0x10FFFF). If a value outside this range is
specified, NULL is returned.
So this code:
SELECT NCHAR(128512);
Returns NULL
in this database.
I would like it to return the same as this:
SELECT N'?';
How can I set a Unicode string variable (e.g. nvarchar) to an emoji using code (without using the actual emoji character) in a database where the collation "does not contain the supplementary character (SC) flag"?
Full list of emoji Unicode code points
(Ultimately I want any character to work. I just chose emoji for ease of reference.)
(Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.)
Assuming that there is no way, could I reference an inline user-defined function in another database which had an appropriate collation?
How do I find a collation which has the "supplementary character" flag?
This returns no records on our server:
SELECT * FROM sys.fn_helpcollations()
WHERE name LIKE 'SQL%[_]SC';
It seems like SQL Server 2012 introduced Latin1_General_100_CI_AS_SC
which would work. Can you install collations on older instances?
Collation References:
- Answer to What is the difference between char, nchar, varchar, and nvarchar in SQL Server?
- Microsoft's Supplementary Characters Collation Information
- Microsoft's SQL Server 2008 R2 Collation List
Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR
?
Best Answer
The UCS-2 encoding is always 2 bytes per character and has a range of 0 - 65535 (0x0000 - 0xFFFF). UTF-16 (regardless of Big Endian or Little Endian) has a range of 0 - 1114111 (0x0000 - 0x10FFFF). The 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is 2 bytes per character while the range above 65536 / 0xFFFF is 4 bytes per character.
Windows and SQL Server started out using the UCS-2 encoding because it was available and UTF-16 hadn't been finalized yet. Fortunately, however, there was enough fore-thought put into the designs of UCS-2 and UTF-16 that the UCS-2 mappings are a complete subset of the UTF-16 mappings (meaning: the 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is UCS-2). AND, the 65536 - 1114111 (0x10000 - 0x10FFFF) range of UTF-16 is constructed from two Code Points in the UCS-2 range (ranges 0xD800 – 0xDBFF and 0xDC00 – 0xDFFF, specifically) that were reserved for this purpose and otherwise have no meaning. This combination of two Code Points is known as a Surrogate Pair, and Surrogate Pairs represent characters beyond the UCS-2 range which are known as Supplementary Characters.
All of that information explains two aspects of
NVARCHAR
/ Unicode data in SQL Server:NCHAR()
) don't handle Surrogate Pairs / Supplementary Characters when not using a Supplementary Character-Aware Collation (SCA; i.e. one with_SC
, or_140_
but not_BIN*
in the name) because the non-SCA Collations (especially theSQL_
Collations) were originally implemented prior to UTF-16 being completed (sometime in 2000, I believe). The non-SQL_
Collations that have_90_
or_100_
in their names but not_SC
have minimal support for Supplementary Characters in terms of comparison and sorting.NVARCHAR
/NCHAR
/XML
/NTEXT
datatypes because UCS-2 and UTF-16 are the exact same byte sequences. The only difference is that UTF-16 makes use of the surrogate code points to construct Surrogate Pairs, and UCS-2 simply can't map them to any characters, hence they appear to the built-in functions as two unknown characters.With that background info in mind, we can now go through the specific questions:
That can only happen if the current database — where the query is being executed — has a default Collation that is Supplementary Character-Aware, and those were introduced in SQL Server 2012. Built-in functions that have string input parameters can have the Collation provided inline via the
COLLATE
clause (i.e.LEN(N'string' COLLATE Some_Collation_SC)
) and don't need to be executed within a Database that has an SCA default Collation. However, built-in functions such asNCHAR()
accept anINT
input parameter and theCOLLATE
clause isn't valid in that context (which is whyNCHAR()
only supports Supplementary Characters when the current database has a default collation that is Supplementary Character-Aware; but this is an unnecessary inconvenience that can be changed, so please vote for my suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation) .How SQL Server can store and retrieve Supplementary Characters without data loss was explained in the top section of this answer. But, it is not true that
NCHAR
is the only built-in function that has issues with Supplementary Characters (when not using an SCA Collation). For example,LEN(N'?' COLLATE SQL_Latin1_General_CP1_CI_AS)
returns a value of 2 whileLEN(N'?' COLLATE Latin1_General_100_CI_AS_SC)
returns a value of 1.If you go to the second link posted in the Question (i.e. "Microsoft's Supplementary Characters Collation Information") and scroll down just a little, you will see a chart of the built-in functions and how they behave based on the effective Collation.
In a version of SQL Server prior to 2012 you can't. But, starting with SQL Server 2012, you can use the following query:
Your query was close, but the pattern started with
SQL
and the SQL Server Collations (i.e. those starting withSQL_
) have been deprecated for a while in favor of the Windows Collations (those not starting withSQL_
). So, theSQL_
Collations are not being updated and hence have no newer versions that would include the_SC
option (and starting in SQL Server 2017, all new collations automatically support Supplementary Characters and do not need, or have, the_SC
flag; and yes, the query shown immediately above accounts for that as well as picking up the_UTF8
collations added in SQL Server 2019).No, you cannot install Collations into a previous version of SQL Server.
When not using an SCA Collation, you can inject Code Points above 65535 / U+FFFF in two ways:
NCHAR()
function, each with one part of the pairVARBINARY
form of the Little Endian (i.e. reversed) byte sequence.These two methods of inserting Supplementary Characters / Surrogate Pairs will work even if the effective Collation is Supplementary Character-Aware, and should work just the same across all versions of SQL Server, at least as far back as 2005 (though probably would also work in SQL Server 2000 as well).
Example:
?
UPDATE
You can use the following iTVF to get the Surrogate Pair values (in both
INT
andBINARY
form) from any Code Point between 65536 - 1114111 (0x010000 - 0x10FFFF). And, while the input parameter is of typeINT
, you can pass in the binary / hex form of the Code Point and it will implicitly convert to the correct integer value.Using the above function, the following two queries:
both return the following:
UPDATE 2: An Even Better Update!
I have adapted the iTVF shown above to now return 188,657 code points so you don't need to fit it any particular value. Of course, being a TVF, you can add a
WHERE
clause to filter on a particular code point, or range of code points, or "similar characters", etc. And, it includes additional columns with pre-formatted escape sequences to construct each code point (both BMP and Supplementary Characters) in T-SQL (without requiring an "_SC
" or "_140_
" collation), HTML (and XML), the style common to many app languages ("\uHHHH" ; used for C++ / C# / F# / Java / JavaScript / Julia / etc), and finally the slightly newer, other common style that handles all code points, not just BMP ("\UHHHHHHHH" ; used for C / C++ / C# / F# / Julia / etc).Read all about it here:
SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis ?)