Sql-server – SQL Server 2016 Collation and Internal table Glitch

collationsql serversql-server-2016

My Steps:

installed SQL 2016 with collation SQL_Latin1_General_CP850_CI_AS
Restored Database to Stage instances with same collation as Database.(original Backup from 2012)
Ran > "C:\Program Files\Microsoft SQL Server\MSSQL13.STAGE\MSSQL\Binn\sqlservr.exe" -m -T4022 -T3659 -s"STAGE" -q"SQL_Latin1_General_CP1_CI_AS"
this was to Changed the Instances and all database collation to SQL_Latin1_General_CP1_CI_AS. (what doesn't get changed User Type Tables)

Run Query

Select i.name, c.collation_name,i.internal_type,i.internal_type_desc
from sys.columns c inner join sys.internal_tables i on c.object_id=i.object_id
WHERE c.collation_name <> 'SQL_Latin1_General_CP1_CI_AS'
AND object_name(c.object_id) NOT LIKE 'sys%'
AND object_name(c.object_id) NOT LIKE 'queue%'
AND object_name(c.object_id) NOT LIKE 'file%'
AND object_name(c.object_id) NOT LIKE 'spt%'
AND object_name(c.object_id) NOT LIKE 'MSrep%'

Will return the internal tables with Old collation.

name         collation_name                  internal_type internal_type_desc
sqlagent_jobs SQL_Latin1_General_CP850_CI_AS   220         CONTAINED_FEATURES

Test 2:

Create New database on 2016 when the Instances collation is set to SQL_Latin1_General_CP850_CI_AS.
new database collation has SQL_Latin1_General_CP850_CI_AS.

Run this Query where collation = SQL_Latin1_General_CP1_CI_AS

Select i.name, c.collation_name,i.internal_type,i.internal_type_desc
from sys.columns c inner join sys.internal_tables i on c.object_id=i.object_id
WHERE c.collation_name = 'SQL_Latin1_General_CP1_CI_AS'
AND object_name(c.object_id) NOT LIKE 'sys%'
AND object_name(c.object_id) NOT LIKE 'queue%'
AND object_name(c.object_id) NOT LIKE 'file%'
AND object_name(c.object_id) NOT LIKE 'spt%'
AND object_name(c.object_id) NOT LIKE 'MSrep%'

this will return results that don't match the current collation.
name collation_name internal_type internal_type_desc
sqlagent_jobs SQL_Latin1_General_CP1_CI_AS 220 CONTAINED_FEATURES

I don't think the application will be affected by this internal_table.
But still strange that the internal_table collation does not match, on Creating and after converting the instances to a new collation.

If I do the same process but this time on a 2012 Staging instances.

restore Database to a 2012 Stagging instances with collation SQL_Latin1_General_CP850_CI_AS
Ran > "C:\Program Files\Microsoft SQL Server\MSSQL11.STAGE12\MSSQL\Binn\sqlservr.exe" -m -T4022 -T3659 -s"STAGE" -q"SQL_Latin1_General_CP1_CI_AS"
fix User Type Tables
Backup converted database.
restore database on SQL 2016 with collation SQL_Latin1_General_CP1_CI_AS.

Run Query

Select i.name, c.collation_name,i.internal_type,i.internal_type_desc
from sys.columns c inner join sys.internal_tables i on c.object_id=i.object_id
WHERE c.collation_name <> 'SQL_Latin1_General_CP1_CI_AS'
AND object_name(c.object_id) NOT LIKE 'sys%'
AND object_name(c.object_id) NOT LIKE 'queue%'
AND object_name(c.object_id) NOT LIKE 'file%'
AND object_name(c.object_id) NOT LIKE 'spt%'
AND object_name(c.object_id) NOT LIKE 'MSrep%'

Nothing returns because I converted it in 2012.

If anyone can explain what is going? Is this a bug?
Anyone know how to fix the internal table collation?

Best Answer

I have looked at the internal table columns collations across most of my test instances and they are all either SQL_Latin1_General_CP1_CI_AS or Latin1_General_BIN. Some instances are SQL Server Express LocalDB, which are always SQL_Latin1_General_CP1_CI_AS at the instance level (unfortunately), but I have a 2017 Express installed with Korean_100_CS_AS_KS_WS_SC at the instance-level and a 2019 CTP 2.3 installed with UTF8_BIN2 at the instance-level. I have DBs restored into the 2019 instance (from CTP 2.2, not an earlier version as in your case) with collations of French_100_CI_AS_SC_UTF8 and French_100_CI_AS.

I have another SQL Server 2017 Express instance currently using Estonian_100_CS_AS_SC at the instance and DBs levels, and that instance has been changed using the -q option many times for testing said option. It even has a Contained DB (containment_type = PARTIAL), and all columns in Internal Tables are either SQL_Latin1_General_CP1_CI_AS or Latin1_General_BIN, no exceptions.

I would not worry about these columns. They are clearly all MS Shipped objects that desire to keep their specific collation. And you will never have a time when 100% of collations are what you set the instance to. There are collations used for meta-data that have their own static collation. There are collations coming from expressions in system catalog views and functions with explicit collations set, etc. You can use the following query to see all of the collations that are not what you set the instance and database to:

SELECT OBJECT_SCHEMA_NAME(col.[object_id]) AS [SchemaName],
       ISNULL(obj.[name], OBJECT_NAME(col.[object_id])) AS [ObjectName],
       col.[name] AS [ColumnName],
       col.[collation_name] AS [Collation],
       obj.[type_desc],
       ISNULL(obj.[is_ms_shipped],
              OBJECTPROPERTY(col.[object_id], 'IsMSShipped')) AS [IsMSShipped],
       col.[object_id]
FROM   sys.[all_columns] col
LEFT JOIN sys.[all_objects] obj
       ON obj.[object_id] = col.[object_id]
WHERE  col.[collation_name] IS NOT NULL
AND    col.[collation_name] NOT IN (
                         CONVERT([sysname], SERVERPROPERTY('collation')),
                         CONVERT([sysname], DATABASEPROPERTYEX(DB_NAME(), 'collation')))
ORDER BY [SchemaName], [ObjectName], [ColumnName];

Some objects are not in sys.all_objects, such as sys.pdw_nodes_partitions, which is the same object_id everywhere (across instances and even versions of SQL Server, but did not exist in SQL Server 2012):

SELECT OBJECT_SCHEMA_NAME(-1046288262) AS [SchemaName],
       OBJECT_NAME(-1046288262) AS [ObjectName],
       OBJECTPROPERTY(-1046288262, 'IsExecuted') AS [IsExecuted],
       OBJECTPROPERTY(-1046288262, 'IsView') AS [IsView],
       OBJECTPROPERTY(-1046288262, 'IsMSShipped') AS [IsMSShipped];

Still, I am unsure how to explain the results you saw for Test 1. You said:

Will return the internal tables with Old collation.

Based on everything I found across all of those versions, etc in the top part of this answer, it should not ever be possible to get a column in sqlagent_jobs to be anything other than SQL_Latin1_General_CP1_CI_AS. This could be a fluke with the database upgrade process when creating those internal tables. The "CONTAINED_FEATURES" tables did not exist in SQL Server 2012, so when you restored into SQL Server 2016, they had to be created at that point. And your 2016 collation is SQL_Latin1_General_CP850_CI_AS, which is the only connection I see so far.

I WAS ABLE TO REPRODUCE THIS!! I backed up a DB from SQL Server 2012 LocalDB that had a collation of Latin1_General_100_CS_AS_KS_SC. I then restored that into SQL Server 2019 (the instance-level collation being UTF8_BIN2). Checking the internal tables I saw that the 3 queue_messages_* tables and the filestream_tombstone_2073058421 table (all of which existed in SQL Server 2012) still had a collation of Latin1_General_BIN. BUT, the sqlagent_* tables and plan_persist_* tables (none of which existed in SQL Server 2012), all had a collation of Latin1_General_100_CS_AS_KS_SC (same as the DB that was restored).

I then backed up a DB from SQL Server 2014 LocalDB that had a collation of Latin1_General_100_CS_AS_KS_WS_SC. I then restored that into SQL Server 2019 (the instance-level collation being UTF8_BIN2). Checking the internal tables I saw that the 3 queue_messages_* tables and the filestream_tombstone_2073058421 table (all of which existed in SQL Server 2012) still had a collation of Latin1_General_BIN. BUT, this time the sqlagent_* tables (which did exist in SQL Server 2014), all had a collation of SQL_Latin1_General_CP1_CI_AS. The plan_persist_* tables (only 2 of the 6 existed in SQL Server 2014), all had a collation of Latin1_General_100_CS_AS_KS_WS_SC (again, same as the DB that was restored).

I changed the compatibility mode of the DB that was restored from SQL Server 2012 to be "SQL Server 2019 (150)", but that did not fix the collation issue.

This is clearly a problem with the Database Upgrade process, and has nothing to do with the sqlservr.exe -q method of changing all collations on a server (which is officially undocumented and unsupported, but I did document it here: Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?). I have reported this bug to Microsoft via the barely functional, substandard-even-by-2008-2010-standards reporting site, UserVoice: Restoring databases into newer versions of SQL Server will create missing Internal Tables with incorrect collation.

Regarding Test 2:

collation = SQL_Latin1_General_CP1_CI_AS

Yes, it is to be expected that those internal tables will use that old, should-have-been-removed-by-now, collation.

Regarding Test 3:

Nothing returns because I converted it in 2012.

Not true. Nothing returns because:

You filtered out the 3 queue tables via AND object_name(c.object_id) NOT LIKE 'queue%', and those use Latin1_General_BIN.
You filtered out all of the other tables via WHERE c.collation_name <> 'SQL_Latin1_General_CP1_CI_AS'.

Related Solutions

Sql-server – Does Detach/Attach or Offline/Online Clear the Buffer Cache for a Particular Database

I initially thought you were on to something here. Working assumption was along the lines that perhaps the buffer pool wasn't immediately flushed as it requires "some work" to do so and why bother until the memory was required. But...

Your test is flawed.

What you're seeing in the buffer pool is the pages read as a result of re-attaching the database, not the remains of the previous instance of the database.

And we can see that the buffer pool was not totally blown away by the detach/attach. Seems like my buddy was wrong. Does anyone disagree or have a better argument?

Yes. You're interpreting physical reads 0 as meaning there were not any physical reads

Table 'DatabaseLog'. Scan count 1, logical reads 782, physical reads 0, read-ahead reads 768, lob logical reads 94, lob physical reads 4, lob read-ahead reads 24.

As described on Craig Freedman's blog the sequential read ahead mechanism tries to ensure that pages are in memory before they're requested by the query processor, which is why you see zero or a lower than expected physical read count reported.

When SQL Server performs a sequential scan of a large table, the storage engine initiates the read ahead mechanism to ensure that pages are in memory and ready to scan before they are needed by the query processor. The read ahead mechanism tries to stay 500 pages ahead of the scan.

None of the pages required to satisfy your query were in memory until read-ahead put them there.

As to why online/offline results in a different buffer pool profile warrants a little more idle investigation. @MarkSRasmussen might be able to help us out with that next time he visits.

Sql-server – How to set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character

The UCS-2 encoding is always 2 bytes per character and has a range of 0 - 65535 (0x0000 - 0xFFFF). UTF-16 (regardless of Big Endian or Little Endian) has a range of 0 - 1114111 (0x0000 - 0x10FFFF). The 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is 2 bytes per character while the range above 65536 / 0xFFFF is 4 bytes per character.

Windows and SQL Server started out using the UCS-2 encoding because it was available and UTF-16 hadn't been finalized yet. Fortunately, however, there was enough fore-thought put into the designs of UCS-2 and UTF-16 that the UCS-2 mappings are a complete subset of the UTF-16 mappings (meaning: the 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is UCS-2). AND, the 65536 - 1114111 (0x10000 - 0x10FFFF) range of UTF-16 is constructed from two Code Points in the UCS-2 range (ranges 0xD800 – 0xDBFF and 0xDC00 – 0xDFFF, specifically) that were reserved for this purpose and otherwise have no meaning. This combination of two Code Points is known as a Surrogate Pair, and Surrogate Pairs represent characters beyond the UCS-2 range which are known as Supplementary Characters.

All of that information explains two aspects of NVARCHAR / Unicode data in SQL Server:

Several built-in functions (not just NCHAR()) don't handle Surrogate Pairs / Supplementary Characters when not using a Supplementary Character-Aware Collation (SCA; i.e. one with _SC, or _140_ but not _BIN* in the name) because the non-SCA Collations (especially the SQL_ Collations) were originally implemented prior to UTF-16 being completed (sometime in 2000, I believe). The non-SQL_ Collations that have _90_ or _100_ in their names but not _SC have minimal support for Supplementary Characters in terms of comparison and sorting.
The full Unicode / UTF-16 character set can be stored, without any data loss, in the NVARCHAR / NCHAR / XML / NTEXT datatypes because UCS-2 and UTF-16 are the exact same byte sequences. The only difference is that UTF-16 makes use of the surrogate code points to construct Surrogate Pairs, and UCS-2 simply can't map them to any characters, hence they appear to the built-in functions as two unknown characters.

With that background info in mind, we can now go through the specific questions:

I would like SELECT NCHAR(128512); to return the same as this: SELECT N'?';

That can only happen if the current database — where the query is being executed — has a default Collation that is Supplementary Character-Aware, and those were introduced in SQL Server 2012. Built-in functions that have string input parameters can have the Collation provided inline via the COLLATE clause (i.e. LEN(N'string' COLLATE Some_Collation_SC) ) and don't need to be executed within a Database that has an SCA default Collation. However, built-in functions such as NCHAR() accept an INT input parameter and the COLLATE clause isn't valid in that context (which is why NCHAR() only supports Supplementary Characters when the current database has a default collation that is Supplementary Character-Aware; but this is an unnecessary inconvenience that can be changed, so please vote for my suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation) .

Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR?

How SQL Server can store and retrieve Supplementary Characters without data loss was explained in the top section of this answer. But, it is not true that NCHAR is the only built-in function that has issues with Supplementary Characters (when not using an SCA Collation). For example, LEN(N'?' COLLATE SQL_Latin1_General_CP1_CI_AS) returns a value of 2 while LEN(N'?' COLLATE Latin1_General_100_CI_AS_SC) returns a value of 1.

If you go to the second link posted in the Question (i.e. "Microsoft's Supplementary Characters Collation Information") and scroll down just a little, you will see a chart of the built-in functions and how they behave based on the effective Collation.

How do I find a collation which has the "supplementary character" flag?

In a version of SQL Server prior to 2012 you can't. But, starting with SQL Server 2012, you can use the following query:

SELECT col.*
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]SC'
OR     col.[name] LIKE N'%[_]SC[_]%'
OR     (COLLATIONPROPERTY(col.[name], 'Version') = 3
      AND col.[name] NOT LIKE N'%[_]BIN%');

Your query was close, but the pattern started with SQL and the SQL Server Collations (i.e. those starting with SQL_ ) have been deprecated for a while in favor of the Windows Collations (those not starting with SQL_ ). So, the SQL_ Collations are not being updated and hence have no newer versions that would include the _SC option (and starting in SQL Server 2017, all new collations automatically support Supplementary Characters and do not need, or have, the _SC flag; and yes, the query shown immediately above accounts for that as well as picking up the _UTF8 collations added in SQL Server 2019).

Can you install collations on older instances?

No, you cannot install Collations into a previous version of SQL Server.

How can I set a Unicode string variable (e.g. nvarchar) to a Supplementary Character using code (without using the actual Supplementary Character) in a database where the collation "does not contain the supplementary character (SC) flag"?
...
Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.

When not using an SCA Collation, you can inject Code Points above 65535 / U+FFFF in two ways:

Specify the Surrogate Pair in terms of two calls to the NCHAR() function, each with one part of the pair
Specify the Surrogate Pair in terms of converting the VARBINARY form of the Little Endian (i.e. reversed) byte sequence.

These two methods of inserting Supplementary Characters / Surrogate Pairs will work even if the effective Collation is Supplementary Character-Aware, and should work just the same across all versions of SQL Server, at least as far back as 2005 (though probably would also work in SQL Server 2000 as well).

Example:

Character:
?
Name: Pile of Poo
Decimal: 128169
Code Point: U+1F4A9
Surrogate Pair: U+D83D & U+DF21

SELECT N'?', -- ?
       UNICODE(N'?' COLLATE Latin1_General_100_CI_AS), -- 55357
       UNICODE(N'?' COLLATE Latin1_General_100_CI_AS_SC), -- 128169
       NCHAR(128169), -- ? in DB with _SC Collation, else NULL
       NCHAR(0x1F4A9), -- ? in DB with _SC Collation, else NULL
       CONVERT(VARBINARY(4), 128169), -- 0x0001F4A9
       CONVERT(VARBINARY(4), N'?'), -- 0x3DD8A9DC
       CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- ? (regardless of DB Collation)
       NCHAR(0xD83D) + NCHAR(0xDCA9) -- ? (regardless of DB Collation)

UPDATE

You can use the following iTVF to get the Surrogate Pair values (in both INT and BINARY form) from any Code Point between 65536 - 1114111 (0x010000 - 0x10FFFF). And, while the input parameter is of type INT, you can pass in the binary / hex form of the Code Point and it will implicitly convert to the correct integer value.

CREATE FUNCTION dbo.GetSupplementaryCharacterInfo(@CodePoint INT)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN

WITH calc AS
(
  SELECT 55232 + (@CodePoint / 1024) AS [HighSurrogateINT],
         56320 + (@CodePoint % 1024) AS [LowSurrogateINT]
  WHERE  @CodePoint BETWEEN  65536 AND 1114111
)
SELECT @CodePoint AS [CodePointINT],
       HighSurrogateINT,
       LowSurrogateINT,
       CONVERT(VARBINARY(3), @CodePoint) AS [CodePointBIN],
       CONVERT(BINARY(2), HighSurrogateINT) AS [HighSurrogateBIN],
       CONVERT(BINARY(2), LowSurrogateINT) AS [LowSurrogateBIN],
       CONVERT(binary(4), NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT)) AS [UTF-16LE],
       NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT) AS [Character]
FROM   calc;
GO

Using the above function, the following two queries:

SELECT * FROM dbo.GetSupplementaryCharacterInfo(128169);

SELECT * FROM dbo.GetSupplementaryCharacterInfo(0x01F4A9);

both return the following:

CodePoint  HighSurrogate  LowSurrgate  CodePoint  HighSurrgate  LowSurrgate  UTF-16LE   Char
INT        INT            INT          BIN        BIN           BIN                     actr
128169     55357          56489        0x01F4A9   0xD83D        0xDCA9       0x3DD8A9DC   ?

UPDATE 2: An Even Better Update!

I have adapted the iTVF shown above to now return 188,657 code points so you don't need to fit it any particular value. Of course, being a TVF, you can add a WHERE clause to filter on a particular code point, or range of code points, or "similar characters", etc. And, it includes additional columns with pre-formatted escape sequences to construct each code point (both BMP and Supplementary Characters) in T-SQL (without requiring an "_SC" or "_140_" collation), HTML (and XML), the style common to many app languages ("\uHHHH" ; used for C++ / C# / F# / Java / JavaScript / Julia / etc), and finally the slightly newer, other common style that handles all code points, not just BMP ("\UHHHHHHHH" ; used for C / C++ / C# / F# / Julia / etc).

Read all about it here:

SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis ?)

Best Answer

Related Solutions

Sql-server – Does Detach/Attach or Offline/Online Clear the Buffer Cache for a Particular Database

Sql-server – How to set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character

?

Related Question