SQL Server 2008R2 – Cloud Migration and Collation Changes

cloudcollationmigrationsql serversql-server-2008-r2

Recently felt the fall out (crappy legacy code base) of a database migration by a third party vendor where the collation was swapped from case-insensitive to case-sensitive. I am wondering if the collation was deliberately changed during the migration? I would have thought that in any database migration the collation would remain unchanged (i.e. stay the same as the source database collation). Does the server which the database is hosted automatically override the collation maybe?

Best Answer

There is a lot of information missing from the question, such as:

What does "cloud" mean, exactly?
- Microsoft Azure SQL Database
- Microsoft Azure SQL Database Managed Instance
- Microsoft Azure VM with SQL Server installed
- Amazon (AWS) RDS
- Amazon (AWS) VM with SQL Server installed?
- something else
What is meant by "migration"?
- Database was detached / attached, or backed-up / restored
- Objects and data were moved via scripting out and applying at the destination, or directly ETLed
- Data Migration Assistant
- SSDT and BACPAC file
- Azure Database Migration Service
What is meant by "database"?
- An actual Database that resides in an Instance of SQL Server
- An Instance of SQL Server that contains Databases
How does "crappy legacy code base" factor into this Collation issue? Or is it just venting / kvetching?

Generally speaking, there is no obvious / default answer since there is no single mechanism / approach for migrating data, nor is there a singular definition of "cloud" in this context. I would expect that some options (such as restoring from a .bak backup file) would keep the Database's default Collation intact, but that does not imply anything about the Instance-level Collation.

What this comes down to is: even without having more detailed information to go on, since there is a Collation difference at some level (Instance and/or Database), it would seem that the vendor that performed the migration was not careful to ensure that either a) the exact same Collation was being used if given a choice to set a new Collation, or b) that you were notified that there would be a change of Collation at the specific level(s) if there was no choice and it was going to be different than the current Collation(s).

I would definitely contact the vendor regarding this issue. Typically, if there is a default Collation that cannot be changed, it is only the Instance-level, and is case-insensitive:

Azure SQL Database Managed Instances use: SQL_Latin1_General_CP1_CI_AS (unfortunately)
Partially-Contained Databases use: Latin1_General_100_CI_AS_WS_KS_SC
SQL Server Express Local DB uses: SQL_Latin1_General_CP1_CI_AS (unfortunately)

References:

Related Solutions

Database Collation – Case-Sensitive and Accent-Insensitive Options

TL;DR

There is no such thing as a "vendor-agnostic" view of Collations, nor even "version-agnostic", since their implementations -- including which aspects can be made insensitive and their naming conventions -- are vendor-specific and change over time.

Here is a summary of what I have found, and the details are in the longer section below the line:

RDBMS        Naming-             Combinations    Case-Sensitive and
             convention          of options?     Accent-Insensitive support?
-------      ------------        -------------   -----
SQL Server   _CS, _AI, etc       Yes             Latin1_General_100_CS_AI

DB2          _E{x}, _S{y}, etc   Yes             CLDR181_EO_S1

PostgreSQL   locale: en_US       N/A             unaccent(), not via Collation

MySQL        _cs, maybe _ai      No              No: _cs implies _as & _ci implies _ai
                                                 Yes? Create your own Collation :-)

Oracle       only _CI & _AI      No              No: _AI always implies _CI

SAP ASE      arbitrary: turdict  N/A             No: "AI" always implies "CI"

Informix     locale.codepage     N/A             No: no "AI" via Collations

As you can see in the chart, two of the seven RDBMSs do natively support "Case-sensitive and Accent-insensitive" operations via Collations, though they have different naming conventions (and several other functional differences).

One RDBMS -- PostgreSQL -- doesn't natively support this combination, but you can still achieve it by stripping off the accents with the unaccent() add-on function.

The last four RDBMSs, two of which have a similar naming convention for the options, neither natively support this combination nor does there appear to be a means of accomplishing this without writing your own function to remove the accents / diacritical marks. MySQL allows for creating your own Collations, but that requires that you then add it to source control and incorporate it into your testing & deployment process so that it can be applied to all servers in all environments (but still a very cool and flexible option). SAP ASE mentions that SAP can supply additional Unicode sort orders, but no mention of what they might be willing to supply.

With regards to:

Is there a good reason for this or is mine merely a rare use case?

I can say that in doing the research for this answer I came across a lot of instances of people wanting case-insensitive and accent-sensitive for MySQL, but few, if any, asking for your desired combination.

I wanted a search condition to use a case-sensitive but accent-insensitive collation but couldn't find one.
...
this question is vendor/version agnostic

You were unsuccessful in your search because it doesn't really make sense to look for an RDBMS based on a Collation specification. That's just not how Collations work. And while you are wanting to approach this as vendor-agnostic, reality is that Collations -- at least the part that we interact with -- are very much vendor-specific, and don't always fit into the scheme that you were searching on.

String comparison and sorting is highly complex, and there are different ways of performing these rules. One method is to have mappings that take into account one or more rules. Hence the four combinations of Sensitive and Insensitive for Case and Accents would equate to four separate mappings. For example, you saw this on that MSDN page for SQL Server Collation Name. If you scroll down, you will see that the left column of the chart is the Sort Order ID. Each Collation has a different ID: SQL_Latin1_General_Cp1_CI_AS = 52 while SQL_Latin1_General_Cp1_CS_AS = 51, even though the only difference is in the case-sensitivity.

Or, it can be rule-based, such as what Unicode offers through the Unicode Collation Algorithm (UCA). In this approach, every character is given, by default, one or more weights. Then, each culture / locale has the option to override any of those weights, or remove rules, or add rules. The algorithm takes into account any locale-specific rules, and then potentially manipulates those weights based on any options chosen (sensitivity, which case comes first when doing case-sensitive sorts, etc). This is one reason why doing Unicode sorting is a bit slower than non-Unicode sorting.

To get a sense of how many options there really are (i.e. the actual complexity), check out this demo from the ICU (International Components for Unicode) project:

ICU Collation Demo

There are 8 separate options to specify, and some of them get represented in multiple elements of the Collation name specification that you are thinking of (e.g. CS, CI, AS, AI, etc). Given how many variations there are, using the mapping file approach where each combination has its own ID would result in many thousands of files. Many of those files would need to get updated whenever there are changes in those particular languages, or when bugs are found. This is probably why there are only 75 of those type of Collations in SQL Server 2012 (i.e. those with names starting with SQL_). Hence no combination for _CS_AI.

And the reason why you couldn't find that combination for the UCA-based Collations? Well, there are 3810 Collations in SQL Server 2012 that do not begin with SQL_, so 3885 Collations total. That list seems to be too long to be enumerated fully on a web page. But this doesn't fully explain why you couldn't find this combination for other vendors.

Beyond what has already been mentioned (i.e. too many combinations to implement, and too many implementations to list), you still need to contend with vendor-specific implementations. Meaning: not all vendors allow for tailoring all of those options, and there is no standard naming convention for Collations in the first place. Also, not all vendors view the sorting options as being a part of the Collation: PostgreSQL Collations are default ordering for the chosen locale, and you need to use ILIKE to get a case-insensitive comparison. See below for vendor-specific info.

SQL Server (Microsoft)

The distinction between what you are seeing on those two MSDN documentation pages and the query provided by @MartinSmith in a comment on the question (slightly revised below):

SELECT *
FROM   sys.fn_helpcollations()
WHERE  [name] LIKE '%[_]CS[_]AI%';

is that those two MSDN pages are referring specifically to the very deprecated SQL Server Collations, whereas the collations that show up as a result of that query (888 of them as of SQL Server 2012, SP3) are Windows Collations.

Starting in SQL Server 2000, the older SQL Server Collations (created prior to SQL Server being able to tap into the Windows Collations) are deprecated and are not being updated with new rules or functionality. For example, starting in SQL Server 2012, a set of Collations was added that support the proper handling of the built-in functions for Supplementary Characters (i.e. the remaining UTF-16 characters beyond the "base" 65,536 characters initially defined in UCS-2). These newer Collations end in _SC (as in Supplementary Characters).

It is best to not use the SQL Server Collations -- those with names starting with SQL_. Hence you do have access to plenty of Collations that support the combination of options that you are looking for (i.e. Case-Sensitive and Accent-Insensitive). Whenever available, it is also best to use the one end in _SC as long as it has all of the other options you are wanting.

While SQL Server does use the _CS_AI naming convention, there is no list of all 3810 (as of SQL Server 2012) Windows Collations. There is just the Windows Collation Name page that lists all of the locales and versions, and how the naming convention works, but that's it.

SQL Server also supports toggling both Width and Kana sensitivity.

MySQL (bought by Oracle)

The MySQL version 5.7, documentation states that it does support the _ai, _as, _ci, and _cs suffixes (and _bin for completeness), but also states:

For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity. That is, if a collation name does not contain _ai or _as, _ci in the name implies _ai and _cs in the name implies _as.

For example, latin1_general_ci is case insensitive (and accent insensitive, implicitly), latin1_general_cs is case sensitive (and accent sensitive, implicitly)

This certainly implies that it is possible to have a latin1_general_cs_ai Collation. However, the MySQL 5.5.50 server that I have access to does not have any collations with more than one suffix, and the only suffixes I see are: _cs, _ci, and _bin across 198 total Collations. I used the SHOW COLLATION command to list them.

So, while it sounds like MySQL uses a similar naming convention (at least as far as those two options go), I cannot find a Collation matching what you are looking for. However, it might be possible to strip off the accents (and other diacritical marks) and use a _cs collation to get what you want (similar to how you would do it in PostgreSQL -- see below). But I am not sure of this option and do not have time at the moment to research further.

OR, you might could create your very own Collation to do exactly what you want. Unlike the other RDBMSs, MySQL appears to make it rather simple to add your own Collations, in which case you are in full control over the weighting of each character. Please see Adding a Simple Collation to an 8-Bit Character Set and Adding a UCA Collation to a Unicode Character Set for more details.

For more info on how MySQL handles different types of Collations, please see their Collation Implementation Types page.

PostgreSQL

Collations in PostgreSQL seem to be far less flexible. You specify only the culture / locale: en_US, de_DE, etc. Please see their documentation page for Collation Support for details. Hence, by default you get the culture-specific overrides, but the Collations are otherwise everything-sensitive (which, by the way, is not the same as a "binary" collation).

You can use ILIKE (section 9.7.1) to get case insensitivity, but they do not have a similar operator for accent sensitivity. However, I found that they do have an unaccent function that can be used to strip off accents and other diacritical marks. Please note that this function is an Additional Supplied Module and hence is not necessarily present in any particular PostgreSQL server to use. That most recently linked documentation states:

When building from the source distribution, these components are not built automatically, unless you build the "world" target
...
If you are using a pre-packaged version of PostgreSQL, these modules are typically made available as a separate subpackage, such as postgresql-contrib.

Please see that documentation for instructions on how to get that function if you don't have it and want it.

More information can also be found in the following Stack Overflow answer:

Does PostgreSQL support “accent insensitive” collations?

DB2 (IBM)

Similar to Microsoft SQL Server, DB2 has two types of Collations:

"SYSTEM" Collations, which are specified using the following format: SYSTEM_{codepage}_[optional-territory]. These are not very flexible, and do not appear to support tailoring sensitivity to case, accents, or anything. You can find the list of supported Collations here: Supported territory codes and code pages
Unicode Collation Algorithm (UCA)-based Collations. These do support quite a bit of tailoring. Please see their Unicode Collation Algorithm based collations page for details about how to configure the behavior, the naming convention, and the list of valid locales. Please note that in Table 1, the example in the third row ("Case Level") starts with:

Setting the Case Level attribute to on and the Strength attribute to primary level will ignore accent but not case.

That is exactly what you were looking for. But, the syntax for that is: CLDR181_EO_S1. And this is why your searching did not find anything related to DB2.

Oracle

Oracle 10g added support for doing accent insensitive comparisons and sorting. However:

they only have the options to denote "insensitive" operations: _CI and _AI
you can only specify one of those options at a time
the case-insensitive option -- _CI -- is still accent sensitive
the accent-insensitive option -- _AI -- "is always case-insensitive as well." (quoted from their documentation which is linked below)

Please see their Linguistic Sorting and String Searching documentation page for more details and examples.

SAP ASE (formerly Sybase ASE, a.k.a. Sybase)

ASE supports one or more of the following combinations of sensitivities per each locale / character set:

case-sensitive, accent-sensitive
case-insensitive, accent-sensitive
case-insensitive, accent-sensitive, order with preference
case-insensitive, accent-insensitive

You can see the relationship between locale, character set, and available sort orders on their Selecting the Default Sort Order page. And you can see the full list of Collations on their Collation Names and IDs page.

Their Collation naming convention is arbitrary in that they are all 4 - 8 characters and try to capture the locale name or code page and and some sense of the sorting. For example:

altnoacc == "CP 850 Alternative – no accent"
rusdict == "Russian dictionary ordering"
dynix == "Chinese phonetic ordering"

There is a note on their Selecting the Default Unicode Sort Order page stating:

You can add sort orders using external files in the $/collate/Unicode directory. The names and collation IDs are stored in syscharsets. The names of external Unicode sort orders do not have to be in syscharsets before you can set the default Unicode sort order.
...
External Unicode sort orders are provided by SAP. Do not attempt to create external Unicode sort orders.

It is unclear whether or not SAP would supply an external sort order to allow for Case-Sensitive and Accent-Insensitive. Maybe someday I will email them and ask if one could be requested.

In order to get the desired combination of sensitivities, you should be able to create a scalar user-defined function to strip off accents and other diacritical marks.

Informix (bought by IBM)

Informix appears to mostly just support the default sorting and comparison behavior of a Collation. Hence Collations are just the locale and character set. Case-sensitivity is handled at the database-level, and by default they are case-sensitive. You can set a database (not a table, or a column, or a query, or even a predicate) to be case-insensitive by specifying NLSCASE INSENSITIVE in the CREATE DATABASE statement.

While the database Collation -- locale and character set -- can be overridden per client connection, there does not seem to be a way to override the case-sensitivity setting. AND, the NLSCASE option has "NLS" in the name for a reason: it only affects NCHAR and NVARCHAR data; CHAR and VARCHAR are always case-sensitive.

Accent-sensitivity is not addressed, nor is there a built-in function to strip of the accents / diacritical marks.

The Informix Collation naming convention is:

<lang>_<country>.<code set>

where:

<lang> = a 2-letter or 3-letter language code
<country> = a 2-letter country or region code
<code set> = the code page specified in one of the 3 following equivalent ways:
- name: 8859-1
- decimal value of IBM CCSID number: 819
- hexadecimal value of IBM CCSID number: 0333

Hence, the following three locale specifications all refer to the exact same locale:

fr_fr.8859-1
fr_fr.819
fr_fr.0333

For more information, please see:

SQL Server 2014 – Can Database Objects Be Case Insensitive While Keeping Strings Case Sensitive?

Mostly "yes", depending on how "automagic" you want everything to work.

First: Database identifier Collation (i.e. object names, column names, index names, etc) is taken from the Database's default Collation. That will determine the Collation of columns such as sys.objects.name, etc. So to do this, you can create or alter the Database to have a case-Insensitive Collation, such as Latin1_General_100_CI_AS_SC. This will allow you to have a table TEST that will be valid in queries such as select * from test;.

Server-scoped identifier Collation (e.g. Database names) is handled by the Instance's default Collation.

Second: If a Database's default Collation is case-Insensitive, then expressions that are entirely string literals and/or variables and/or output-parameters and/or UDF return values will be treated as case-Insensitive. For these expressions you would need to override that default behavior, per expression, using the COLLATE clause.

For example:

/* take database default collation behavior */
IF ('test' = 'TEST')
BEGIN
  PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
  PRINT 'Case-Sensitive';
END;
-- ? result depends on the default collation of the "current" database ?


/* force case-sensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CS_AS_SC)
BEGIN
  PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
  PRINT 'Case-Sensitive';
END;
-- Case-Sensitive


/* force case-insensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CI_AS_SC)
BEGIN
  PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
  PRINT 'Case-Sensitive';
END;
-- Case-Insensitive

Expressions between string columns and either string literals or variables will use the Collation of the string column, and if the Collation for the column is case-sensitive, then the expression will be evaluated as case-sensitive without needing to use the COLLATE clause.

By default, newly created columns that do not specify the COLLATE clause will inherit the Database's default Collation. In this case, you will need to specify something like COLLATE Latin1_General_100_CS_AS_KS_WS_SC for all VARCHAR / NVARCHAR columns that you want to be sensitive across all 4 categories (plus support Supplementary Characters).

PLEASE NOTE: Be very cautious about using a binary Collation (i.e. Latin1_General_100_BIN2) to get an accent/case/kana/width sensitive Collation; binary Collations are not truly case-sensitive.