Sql-server – How to set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character

collationencodingsql servert-sqlunicode

I want to set a Unicode string variable to particular character based on its Unicode code point.

I want to use a code point beyond 65535, but the SQL Server 2008 R2 database has a collation of SQL_Latin1_General_CP1_CI_AS.

According to Microsoft's NCHAR documentation, the NCHAR function takes an integer as follows:

integer_expression

When the collation of the database does not contain the supplementary
character (SC) flag, this is a positive whole number from 0 through
65535 (0 through 0xFFFF). If a value outside this range is specified,
NULL is returned. For more information about supplementary characters,
see Collation and Unicode Support.

When the collation of the database supports the supplementary
character (SC) flag, this is a positive whole number from 0 through
1114111 (0 through 0x10FFFF). If a value outside this range is
specified, NULL is returned.

So this code:

SELECT NCHAR(128512);

Returns NULL in this database.

I would like it to return the same as this:

SELECT N'?';

How can I set a Unicode string variable (e.g. nvarchar) to an emoji using code (without using the actual emoji character) in a database where the collation "does not contain the supplementary character (SC) flag"?

Full list of emoji Unicode code points

(Ultimately I want any character to work. I just chose emoji for ease of reference.)

(Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.)

Assuming that there is no way, could I reference an inline user-defined function in another database which had an appropriate collation?

How do I find a collation which has the "supplementary character" flag?

This returns no records on our server:

SELECT * FROM sys.fn_helpcollations() 
WHERE name LIKE 'SQL%[_]SC';

It seems like SQL Server 2012 introduced Latin1_General_100_CI_AS_SC which would work. Can you install collations on older instances?

Collation References:

Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR?

Best Answer

The UCS-2 encoding is always 2 bytes per character and has a range of 0 - 65535 (0x0000 - 0xFFFF). UTF-16 (regardless of Big Endian or Little Endian) has a range of 0 - 1114111 (0x0000 - 0x10FFFF). The 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is 2 bytes per character while the range above 65536 / 0xFFFF is 4 bytes per character.

Windows and SQL Server started out using the UCS-2 encoding because it was available and UTF-16 hadn't been finalized yet. Fortunately, however, there was enough fore-thought put into the designs of UCS-2 and UTF-16 that the UCS-2 mappings are a complete subset of the UTF-16 mappings (meaning: the 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is UCS-2). AND, the 65536 - 1114111 (0x10000 - 0x10FFFF) range of UTF-16 is constructed from two Code Points in the UCS-2 range (ranges 0xD800 – 0xDBFF and 0xDC00 – 0xDFFF, specifically) that were reserved for this purpose and otherwise have no meaning. This combination of two Code Points is known as a Surrogate Pair, and Surrogate Pairs represent characters beyond the UCS-2 range which are known as Supplementary Characters.

All of that information explains two aspects of NVARCHAR / Unicode data in SQL Server:

Several built-in functions (not just NCHAR()) don't handle Surrogate Pairs / Supplementary Characters when not using a Supplementary Character-Aware Collation (SCA; i.e. one with _SC, or _140_ but not _BIN* in the name) because the non-SCA Collations (especially the SQL_ Collations) were originally implemented prior to UTF-16 being completed (sometime in 2000, I believe). The non-SQL_ Collations that have _90_ or _100_ in their names but not _SC have minimal support for Supplementary Characters in terms of comparison and sorting.
The full Unicode / UTF-16 character set can be stored, without any data loss, in the NVARCHAR / NCHAR / XML / NTEXT datatypes because UCS-2 and UTF-16 are the exact same byte sequences. The only difference is that UTF-16 makes use of the surrogate code points to construct Surrogate Pairs, and UCS-2 simply can't map them to any characters, hence they appear to the built-in functions as two unknown characters.

With that background info in mind, we can now go through the specific questions:

I would like SELECT NCHAR(128512); to return the same as this: SELECT N'?';

That can only happen if the current database — where the query is being executed — has a default Collation that is Supplementary Character-Aware, and those were introduced in SQL Server 2012. Built-in functions that have string input parameters can have the Collation provided inline via the COLLATE clause (i.e. LEN(N'string' COLLATE Some_Collation_SC) ) and don't need to be executed within a Database that has an SCA default Collation. However, built-in functions such as NCHAR() accept an INT input parameter and the COLLATE clause isn't valid in that context (which is why NCHAR() only supports Supplementary Characters when the current database has a default collation that is Supplementary Character-Aware; but this is an unnecessary inconvenience that can be changed, so please vote for my suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation) .

Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR?

How SQL Server can store and retrieve Supplementary Characters without data loss was explained in the top section of this answer. But, it is not true that NCHAR is the only built-in function that has issues with Supplementary Characters (when not using an SCA Collation). For example, LEN(N'?' COLLATE SQL_Latin1_General_CP1_CI_AS) returns a value of 2 while LEN(N'?' COLLATE Latin1_General_100_CI_AS_SC) returns a value of 1.

If you go to the second link posted in the Question (i.e. "Microsoft's Supplementary Characters Collation Information") and scroll down just a little, you will see a chart of the built-in functions and how they behave based on the effective Collation.

How do I find a collation which has the "supplementary character" flag?

In a version of SQL Server prior to 2012 you can't. But, starting with SQL Server 2012, you can use the following query:

SELECT col.*
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]SC'
OR     col.[name] LIKE N'%[_]SC[_]%'
OR     (COLLATIONPROPERTY(col.[name], 'Version') = 3
      AND col.[name] NOT LIKE N'%[_]BIN%');

Your query was close, but the pattern started with SQL and the SQL Server Collations (i.e. those starting with SQL_ ) have been deprecated for a while in favor of the Windows Collations (those not starting with SQL_ ). So, the SQL_ Collations are not being updated and hence have no newer versions that would include the _SC option (and starting in SQL Server 2017, all new collations automatically support Supplementary Characters and do not need, or have, the _SC flag; and yes, the query shown immediately above accounts for that as well as picking up the _UTF8 collations added in SQL Server 2019).

Can you install collations on older instances?

No, you cannot install Collations into a previous version of SQL Server.

How can I set a Unicode string variable (e.g. nvarchar) to a Supplementary Character using code (without using the actual Supplementary Character) in a database where the collation "does not contain the supplementary character (SC) flag"?
...
Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.

When not using an SCA Collation, you can inject Code Points above 65535 / U+FFFF in two ways:

Specify the Surrogate Pair in terms of two calls to the NCHAR() function, each with one part of the pair
Specify the Surrogate Pair in terms of converting the VARBINARY form of the Little Endian (i.e. reversed) byte sequence.

These two methods of inserting Supplementary Characters / Surrogate Pairs will work even if the effective Collation is Supplementary Character-Aware, and should work just the same across all versions of SQL Server, at least as far back as 2005 (though probably would also work in SQL Server 2000 as well).

Example:

Character:
?
Name: Pile of Poo
Decimal: 128169
Code Point: U+1F4A9
Surrogate Pair: U+D83D & U+DF21

SELECT N'?', -- ?
       UNICODE(N'?' COLLATE Latin1_General_100_CI_AS), -- 55357
       UNICODE(N'?' COLLATE Latin1_General_100_CI_AS_SC), -- 128169
       NCHAR(128169), -- ? in DB with _SC Collation, else NULL
       NCHAR(0x1F4A9), -- ? in DB with _SC Collation, else NULL
       CONVERT(VARBINARY(4), 128169), -- 0x0001F4A9
       CONVERT(VARBINARY(4), N'?'), -- 0x3DD8A9DC
       CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- ? (regardless of DB Collation)
       NCHAR(0xD83D) + NCHAR(0xDCA9) -- ? (regardless of DB Collation)

UPDATE

You can use the following iTVF to get the Surrogate Pair values (in both INT and BINARY form) from any Code Point between 65536 - 1114111 (0x010000 - 0x10FFFF). And, while the input parameter is of type INT, you can pass in the binary / hex form of the Code Point and it will implicitly convert to the correct integer value.

CREATE FUNCTION dbo.GetSupplementaryCharacterInfo(@CodePoint INT)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN

WITH calc AS
(
  SELECT 55232 + (@CodePoint / 1024) AS [HighSurrogateINT],
         56320 + (@CodePoint % 1024) AS [LowSurrogateINT]
  WHERE  @CodePoint BETWEEN  65536 AND 1114111
)
SELECT @CodePoint AS [CodePointINT],
       HighSurrogateINT,
       LowSurrogateINT,
       CONVERT(VARBINARY(3), @CodePoint) AS [CodePointBIN],
       CONVERT(BINARY(2), HighSurrogateINT) AS [HighSurrogateBIN],
       CONVERT(BINARY(2), LowSurrogateINT) AS [LowSurrogateBIN],
       CONVERT(binary(4), NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT)) AS [UTF-16LE],
       NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT) AS [Character]
FROM   calc;
GO

Using the above function, the following two queries:

SELECT * FROM dbo.GetSupplementaryCharacterInfo(128169);

SELECT * FROM dbo.GetSupplementaryCharacterInfo(0x01F4A9);

both return the following:

CodePoint  HighSurrogate  LowSurrgate  CodePoint  HighSurrgate  LowSurrgate  UTF-16LE   Char
INT        INT            INT          BIN        BIN           BIN                     actr
128169     55357          56489        0x01F4A9   0xD83D        0xDCA9       0x3DD8A9DC   ?

UPDATE 2: An Even Better Update!

I have adapted the iTVF shown above to now return 188,657 code points so you don't need to fit it any particular value. Of course, being a TVF, you can add a WHERE clause to filter on a particular code point, or range of code points, or "similar characters", etc. And, it includes additional columns with pre-formatted escape sequences to construct each code point (both BMP and Supplementary Characters) in T-SQL (without requiring an "_SC" or "_140_" collation), HTML (and XML), the style common to many app languages ("\uHHHH" ; used for C++ / C# / F# / Java / JavaScript / Julia / etc), and finally the slightly newer, other common style that handles all code points, not just BMP ("\UHHHHHHHH" ; used for C / C++ / C# / F# / Julia / etc).

Read all about it here:

SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis ?)

Best Answer

?

Related Solutions

Sql-server – the SQL Server equivalent of MySQL’s utf8mb4 character set

Sql-server – Automatic Translation when Converting Unicode to non-Unicode / NVARCHAR to VARCHAR

Related Question