I initially thought you were on to something here. Working assumption was along the lines that perhaps the buffer pool wasn't immediately flushed as it requires "some work" to do so and why bother until the memory was required. But...
Your test is flawed.
What you're seeing in the buffer pool is the pages read as a result of re-attaching the database, not the remains of the previous instance of the database.
And we can see that the buffer pool was not totally blown away by the
detach/attach. Seems like my buddy was wrong. Does anyone disagree or
have a better argument?
Yes. You're interpreting physical reads 0
as meaning there were not any physical reads
Table 'DatabaseLog'. Scan count 1, logical reads 782, physical reads
0, read-ahead reads 768, lob logical reads 94, lob physical reads 4,
lob read-ahead reads 24.
As described on Craig Freedman's blog the sequential read ahead mechanism tries to ensure that pages are in memory before they're requested by the query processor, which is why you see zero or a lower than expected physical read count reported.
When SQL Server performs a sequential scan of a large table, the
storage engine initiates the read ahead mechanism to ensure that pages
are in memory and ready to scan before they are needed by the query
processor. The read ahead mechanism tries to stay 500 pages ahead of
the scan.
None of the pages required to satisfy your query were in memory until read-ahead put them there.
As to why online/offline results in a different buffer pool profile warrants a little more idle investigation. @MarkSRasmussen might be able to help us out with that next time he visits.
The UCS-2 encoding is always 2 bytes per character and has a range of 0 - 65535 (0x0000 - 0xFFFF). UTF-16 (regardless of Big Endian or Little Endian) has a range of 0 - 1114111 (0x0000 - 0x10FFFF). The 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is 2 bytes per character while the range above 65536 / 0xFFFF is 4 bytes per character.
Windows and SQL Server started out using the UCS-2 encoding because it was available and UTF-16 hadn't been finalized yet. Fortunately, however, there was enough fore-thought put into the designs of UCS-2 and UTF-16 that the UCS-2 mappings are a complete subset of the UTF-16 mappings (meaning: the 0 - 65535 / 0x0000 - 0xFFFF range of UTF-16 is UCS-2). AND, the 65536 - 1114111 (0x10000 - 0x10FFFF) range of UTF-16 is constructed from two Code Points in the UCS-2 range (ranges 0xD800 – 0xDBFF and 0xDC00 – 0xDFFF, specifically) that were reserved for this purpose and otherwise have no meaning. This combination of two Code Points is known as a Surrogate Pair, and Surrogate Pairs represent characters beyond the UCS-2 range which are known as Supplementary Characters.
All of that information explains two aspects of NVARCHAR
/ Unicode data in SQL Server:
- Several built-in functions (not just
NCHAR()
) don't handle Surrogate Pairs / Supplementary Characters when not using a Supplementary Character-Aware Collation (SCA; i.e. one with _SC
, or _140_
but not _BIN*
in the name) because the non-SCA Collations (especially the SQL_
Collations) were originally implemented prior to UTF-16 being completed (sometime in 2000, I believe). The non-SQL_
Collations that have _90_
or _100_
in their names but not _SC
have minimal support for Supplementary Characters in terms of comparison and sorting.
- The full Unicode / UTF-16 character set can be stored, without any data loss, in the
NVARCHAR
/ NCHAR
/ XML
/ NTEXT
datatypes because UCS-2 and UTF-16 are the exact same byte sequences. The only difference is that UTF-16 makes use of the surrogate code points to construct Surrogate Pairs, and UCS-2 simply can't map them to any characters, hence they appear to the built-in functions as two unknown characters.
With that background info in mind, we can now go through the specific questions:
I would like SELECT NCHAR(128512);
to return the same as this: SELECT N'?';
That can only happen if the current database — where the query is being executed — has a default Collation that is Supplementary Character-Aware, and those were introduced in SQL Server 2012. Built-in functions that have string input parameters can have the Collation provided inline via the COLLATE
clause (i.e. LEN(N'string' COLLATE Some_Collation_SC)
) and don't need to be executed within a Database that has an SCA default Collation. However, built-in functions such as NCHAR()
accept an INT
input parameter and the COLLATE
clause isn't valid in that context (which is why NCHAR()
only supports Supplementary Characters when the current database has a default collation that is Supplementary Character-Aware; but this is an unnecessary inconvenience that can be changed, so please vote for my suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation) .
Is there an explanation for why, regardless of collation, SQL Server can understand and deal with the extended characters except from the perspective of NCHAR
?
How SQL Server can store and retrieve Supplementary Characters without data loss was explained in the top section of this answer. But, it is not true that NCHAR
is the only built-in function that has issues with Supplementary Characters (when not using an SCA Collation). For example, LEN(N'?' COLLATE SQL_Latin1_General_CP1_CI_AS)
returns a value of 2 while LEN(N'?' COLLATE Latin1_General_100_CI_AS_SC)
returns a value of 1.
If you go to the second link posted in the Question (i.e. "Microsoft's Supplementary Characters Collation Information") and scroll down just a little, you will see a chart of the built-in functions and how they behave based on the effective Collation.
How do I find a collation which has the "supplementary character" flag?
In a version of SQL Server prior to 2012 you can't. But, starting with SQL Server 2012, you can use the following query:
SELECT col.*
FROM sys.fn_helpcollations() col
WHERE col.[name] LIKE N'%[_]SC'
OR col.[name] LIKE N'%[_]SC[_]%'
OR (COLLATIONPROPERTY(col.[name], 'Version') = 3
AND col.[name] NOT LIKE N'%[_]BIN%');
Your query was close, but the pattern started with SQL
and the SQL Server Collations (i.e. those starting with SQL_
) have been deprecated for a while in favor of the Windows Collations (those not starting with SQL_
). So, the SQL_
Collations are not being updated and hence have no newer versions that would include the _SC
option (and starting in SQL Server 2017, all new collations automatically support Supplementary Characters and do not need, or have, the _SC
flag; and yes, the query shown immediately above accounts for that as well as picking up the _UTF8
collations added in SQL Server 2019).
Can you install collations on older instances?
No, you cannot install Collations into a previous version of SQL Server.
How can I set a Unicode string variable (e.g. nvarchar) to a Supplementary Character using code (without using the actual Supplementary Character) in a database where the collation "does not contain the supplementary character (SC) flag"?
...
Although the server is SQL Server 2008 R2, I'm also curious about any solutions for later versions.
When not using an SCA Collation, you can inject Code Points above 65535 / U+FFFF in two ways:
- Specify the Surrogate Pair in terms of two calls to the
NCHAR()
function, each with one part of the pair
- Specify the Surrogate Pair in terms of converting the
VARBINARY
form of the Little Endian (i.e. reversed) byte sequence.
These two methods of inserting Supplementary Characters / Surrogate Pairs will work even if the effective Collation is Supplementary Character-Aware, and should work just the same across all versions of SQL Server, at least as far back as 2005 (though probably would also work in SQL Server 2000 as well).
Example:
- Character:
?
- Name: Pile of Poo
- Decimal: 128169
- Code Point: U+1F4A9
- Surrogate Pair: U+D83D & U+DF21
SELECT N'?', -- ?
UNICODE(N'?' COLLATE Latin1_General_100_CI_AS), -- 55357
UNICODE(N'?' COLLATE Latin1_General_100_CI_AS_SC), -- 128169
NCHAR(128169), -- ? in DB with _SC Collation, else NULL
NCHAR(0x1F4A9), -- ? in DB with _SC Collation, else NULL
CONVERT(VARBINARY(4), 128169), -- 0x0001F4A9
CONVERT(VARBINARY(4), N'?'), -- 0x3DD8A9DC
CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- ? (regardless of DB Collation)
NCHAR(0xD83D) + NCHAR(0xDCA9) -- ? (regardless of DB Collation)
UPDATE
You can use the following iTVF to get the Surrogate Pair values (in both INT
and BINARY
form) from any Code Point between 65536 - 1114111 (0x010000 - 0x10FFFF). And, while the input parameter is of type INT
, you can pass in the binary / hex form of the Code Point and it will implicitly convert to the correct integer value.
CREATE FUNCTION dbo.GetSupplementaryCharacterInfo(@CodePoint INT)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
WITH calc AS
(
SELECT 55232 + (@CodePoint / 1024) AS [HighSurrogateINT],
56320 + (@CodePoint % 1024) AS [LowSurrogateINT]
WHERE @CodePoint BETWEEN 65536 AND 1114111
)
SELECT @CodePoint AS [CodePointINT],
HighSurrogateINT,
LowSurrogateINT,
CONVERT(VARBINARY(3), @CodePoint) AS [CodePointBIN],
CONVERT(BINARY(2), HighSurrogateINT) AS [HighSurrogateBIN],
CONVERT(BINARY(2), LowSurrogateINT) AS [LowSurrogateBIN],
CONVERT(binary(4), NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT)) AS [UTF-16LE],
NCHAR(HighSurrogateINT) + NCHAR(LowSurrogateINT) AS [Character]
FROM calc;
GO
Using the above function, the following two queries:
SELECT * FROM dbo.GetSupplementaryCharacterInfo(128169);
SELECT * FROM dbo.GetSupplementaryCharacterInfo(0x01F4A9);
both return the following:
CodePoint HighSurrogate LowSurrgate CodePoint HighSurrgate LowSurrgate UTF-16LE Char
INT INT INT BIN BIN BIN actr
128169 55357 56489 0x01F4A9 0xD83D 0xDCA9 0x3DD8A9DC ?
UPDATE 2: An Even Better Update!
I have adapted the iTVF shown above to now return 188,657 code points so you don't need to fit it any particular value. Of course, being a TVF, you can add a WHERE
clause to filter on a particular code point, or range of code points, or "similar characters", etc. And, it includes additional columns with pre-formatted escape sequences to construct each code point (both BMP and Supplementary Characters) in T-SQL (without requiring an "_SC
" or "_140_
" collation), HTML (and XML), the style common to many app languages ("\uHHHH" ; used for C++ / C# / F# / Java / JavaScript / Julia / etc), and finally the slightly newer, other common style that handles all code points, not just BMP ("\UHHHHHHHH" ; used for C / C++ / C# / F# / Julia / etc).
Read all about it here:
SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis ?)
Best Answer
I have looked at the internal table columns collations across most of my test instances and they are all either
SQL_Latin1_General_CP1_CI_AS
orLatin1_General_BIN
. Some instances are SQL Server Express LocalDB, which are alwaysSQL_Latin1_General_CP1_CI_AS
at the instance level (unfortunately), but I have a 2017 Express installed withKorean_100_CS_AS_KS_WS_SC
at the instance-level and a 2019 CTP 2.3 installed withUTF8_BIN2
at the instance-level. I have DBs restored into the 2019 instance (from CTP 2.2, not an earlier version as in your case) with collations ofFrench_100_CI_AS_SC_UTF8
andFrench_100_CI_AS
.I have another SQL Server 2017 Express instance currently using
Estonian_100_CS_AS_SC
at the instance and DBs levels, and that instance has been changed using the-q
option many times for testing said option. It even has a Contained DB (containment_type =PARTIAL
), and all columns in Internal Tables are eitherSQL_Latin1_General_CP1_CI_AS
orLatin1_General_BIN
, no exceptions.I would not worry about these columns. They are clearly all MS Shipped objects that desire to keep their specific collation. And you will never have a time when 100% of collations are what you set the instance to. There are collations used for meta-data that have their own static collation. There are collations coming from expressions in system catalog views and functions with explicit collations set, etc. You can use the following query to see all of the collations that are not what you set the instance and database to:
Some objects are not in
sys.all_objects
, such assys.pdw_nodes_partitions
, which is the sameobject_id
everywhere (across instances and even versions of SQL Server, but did not exist in SQL Server 2012):Still, I am unsure how to explain the results you saw for Test 1. You said:
Based on everything I found across all of those versions, etc in the top part of this answer, it should not ever be possible to get a column in
sqlagent_jobs
to be anything other thanSQL_Latin1_General_CP1_CI_AS
. This could be a fluke with the database upgrade process when creating those internal tables. The "CONTAINED_FEATURES" tables did not exist in SQL Server 2012, so when you restored into SQL Server 2016, they had to be created at that point. And your 2016 collation isSQL_Latin1_General_CP850_CI_AS
, which is the only connection I see so far.I WAS ABLE TO REPRODUCE THIS!! I backed up a DB from SQL Server 2012 LocalDB that had a collation of
Latin1_General_100_CS_AS_KS_SC
. I then restored that into SQL Server 2019 (the instance-level collation beingUTF8_BIN2
). Checking the internal tables I saw that the 3queue_messages_*
tables and thefilestream_tombstone_2073058421
table (all of which existed in SQL Server 2012) still had a collation ofLatin1_General_BIN
. BUT, thesqlagent_*
tables andplan_persist_*
tables (none of which existed in SQL Server 2012), all had a collation ofLatin1_General_100_CS_AS_KS_SC
(same as the DB that was restored).I then backed up a DB from SQL Server 2014 LocalDB that had a collation of
Latin1_General_100_CS_AS_KS_WS_SC
. I then restored that into SQL Server 2019 (the instance-level collation beingUTF8_BIN2
). Checking the internal tables I saw that the 3queue_messages_*
tables and thefilestream_tombstone_2073058421
table (all of which existed in SQL Server 2012) still had a collation ofLatin1_General_BIN
. BUT, this time thesqlagent_*
tables (which did exist in SQL Server 2014), all had a collation ofSQL_Latin1_General_CP1_CI_AS
. Theplan_persist_*
tables (only 2 of the 6 existed in SQL Server 2014), all had a collation ofLatin1_General_100_CS_AS_KS_WS_SC
(again, same as the DB that was restored).I changed the compatibility mode of the DB that was restored from SQL Server 2012 to be "SQL Server 2019 (150)", but that did not fix the collation issue.
This is clearly a problem with the Database Upgrade process, and has nothing to do with the
sqlservr.exe -q
method of changing all collations on a server (which is officially undocumented and unsupported, but I did document it here: Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?). I have reported this bug to Microsoft via the barely functional, substandard-even-by-2008-2010-standards reporting site, UserVoice: Restoring databases into newer versions of SQL Server will create missing Internal Tables with incorrect collation.Regarding Test 2:
Yes, it is to be expected that those internal tables will use that old, should-have-been-removed-by-now, collation.
Regarding Test 3:
Not true. Nothing returns because:
AND object_name(c.object_id) NOT LIKE 'queue%'
, and those useLatin1_General_BIN
.WHERE c.collation_name <> 'SQL_Latin1_General_CP1_CI_AS'
.