Recently felt the fall out (crappy legacy code base) of a database migration by a third party vendor where the collation was swapped from case-insensitive to case-sensitive. I am wondering if the collation was deliberately changed during the migration? I would have thought that in any database migration the collation would remain unchanged (i.e. stay the same as the source database collation). Does the server which the database is hosted automatically override the collation maybe?
SQL Server 2008R2 – Cloud Migration and Collation Changes
cloudcollationmigrationsql serversql-server-2008-r2
Related Solutions
TL;DR
There is no such thing as a "vendor-agnostic" view of Collations, nor even "version-agnostic", since their implementations -- including which aspects can be made insensitive and their naming conventions -- are vendor-specific and change over time.
Here is a summary of what I have found, and the details are in the longer section below the line:
RDBMS Naming- Combinations Case-Sensitive and
convention of options? Accent-Insensitive support?
------- ------------ ------------- -----
SQL Server _CS, _AI, etc Yes Latin1_General_100_CS_AI
DB2 _E{x}, _S{y}, etc Yes CLDR181_EO_S1
PostgreSQL locale: en_US N/A unaccent(), not via Collation
MySQL _cs, maybe _ai No No: _cs implies _as & _ci implies _ai
Yes? Create your own Collation :-)
Oracle only _CI & _AI No No: _AI always implies _CI
SAP ASE arbitrary: turdict N/A No: "AI" always implies "CI"
Informix locale.codepage N/A No: no "AI" via Collations
As you can see in the chart, two of the seven RDBMSs do natively support "Case-sensitive and Accent-insensitive" operations via Collations, though they have different naming conventions (and several other functional differences).
One RDBMS -- PostgreSQL -- doesn't natively support this combination, but you can still achieve it by stripping off the accents with the unaccent()
add-on function.
The last four RDBMSs, two of which have a similar naming convention for the options, neither natively support this combination nor does there appear to be a means of accomplishing this without writing your own function to remove the accents / diacritical marks. MySQL allows for creating your own Collations, but that requires that you then add it to source control and incorporate it into your testing & deployment process so that it can be applied to all servers in all environments (but still a very cool and flexible option). SAP ASE mentions that SAP can supply additional Unicode sort orders, but no mention of what they might be willing to supply.
With regards to:
Is there a good reason for this or is mine merely a rare use case?
I can say that in doing the research for this answer I came across a lot of instances of people wanting case-insensitive and accent-sensitive for MySQL, but few, if any, asking for your desired combination.
I wanted a search condition to use a case-sensitive but accent-insensitive collation but couldn't find one.
...
this question is vendor/version agnostic
You were unsuccessful in your search because it doesn't really make sense to look for an RDBMS based on a Collation specification. That's just not how Collations work. And while you are wanting to approach this as vendor-agnostic, reality is that Collations -- at least the part that we interact with -- are very much vendor-specific, and don't always fit into the scheme that you were searching on.
String comparison and sorting is highly complex, and there are different ways of performing these rules. One method is to have mappings that take into account one or more rules. Hence the four combinations of Sensitive and Insensitive for Case and Accents would equate to four separate mappings. For example, you saw this on that MSDN page for SQL Server Collation Name. If you scroll down, you will see that the left column of the chart is the Sort Order ID
. Each Collation has a different ID: SQL_Latin1_General_Cp1_CI_AS
= 52 while SQL_Latin1_General_Cp1_CS_AS
= 51, even though the only difference is in the case-sensitivity.
Or, it can be rule-based, such as what Unicode offers through the Unicode Collation Algorithm (UCA). In this approach, every character is given, by default, one or more weights. Then, each culture / locale has the option to override any of those weights, or remove rules, or add rules. The algorithm takes into account any locale-specific rules, and then potentially manipulates those weights based on any options chosen (sensitivity, which case comes first when doing case-sensitive sorts, etc). This is one reason why doing Unicode sorting is a bit slower than non-Unicode sorting.
To get a sense of how many options there really are (i.e. the actual complexity), check out this demo from the ICU (International Components for Unicode) project:
There are 8 separate options to specify, and some of them get represented in multiple elements of the Collation name specification that you are thinking of (e.g. CS
, CI
, AS
, AI
, etc). Given how many variations there are, using the mapping file approach where each combination has its own ID would result in many thousands of files. Many of those files would need to get updated whenever there are changes in those particular languages, or when bugs are found. This is probably why there are only 75 of those type of Collations in SQL Server 2012 (i.e. those with names starting with SQL_
). Hence no combination for _CS_AI
.
And the reason why you couldn't find that combination for the UCA-based Collations? Well, there are 3810 Collations in SQL Server 2012 that do not begin with SQL_
, so 3885 Collations total. That list seems to be too long to be enumerated fully on a web page. But this doesn't fully explain why you couldn't find this combination for other vendors.
Beyond what has already been mentioned (i.e. too many combinations to implement, and too many implementations to list), you still need to contend with vendor-specific implementations. Meaning: not all vendors allow for tailoring all of those options, and there is no standard naming convention for Collations in the first place. Also, not all vendors view the sorting options as being a part of the Collation: PostgreSQL Collations are default ordering for the chosen locale, and you need to use ILIKE
to get a case-insensitive comparison. See below for vendor-specific info.
SQL Server (Microsoft)
The distinction between what you are seeing on those two MSDN documentation pages and the query provided by @MartinSmith in a comment on the question (slightly revised below):
SELECT *
FROM sys.fn_helpcollations()
WHERE [name] LIKE '%[_]CS[_]AI%';
is that those two MSDN pages are referring specifically to the very deprecated SQL Server Collations, whereas the collations that show up as a result of that query (888 of them as of SQL Server 2012, SP3) are Windows Collations.
Starting in SQL Server 2000, the older SQL Server Collations (created prior to SQL Server being able to tap into the Windows Collations) are deprecated and are not being updated with new rules or functionality. For example, starting in SQL Server 2012, a set of Collations was added that support the proper handling of the built-in functions for Supplementary Characters (i.e. the remaining UTF-16 characters beyond the "base" 65,536 characters initially defined in UCS-2). These newer Collations end in _SC
(as in Supplementary Characters).
It is best to not use the SQL Server Collations -- those with names starting with SQL_
. Hence you do have access to plenty of Collations that support the combination of options that you are looking for (i.e. Case-Sensitive and Accent-Insensitive). Whenever available, it is also best to use the one end in _SC
as long as it has all of the other options you are wanting.
While SQL Server does use the _CS_AI
naming convention, there is no list of all 3810 (as of SQL Server 2012) Windows Collations. There is just the Windows Collation Name page that lists all of the locales and versions, and how the naming convention works, but that's it.
SQL Server also supports toggling both Width and Kana sensitivity.
MySQL (bought by Oracle)
The MySQL version 5.7, documentation states that it does support the _ai
, _as
, _ci
, and _cs
suffixes (and _bin
for completeness), but also states:
For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity. That is, if a collation name does not contain
_ai
or_as
,_ci
in the name implies_ai
and_cs
in the name implies_as
.For example,
latin1_general_ci
is case insensitive (and accent insensitive, implicitly),latin1_general_cs
is case sensitive (and accent sensitive, implicitly)
This certainly implies that it is possible to have a latin1_general_cs_ai
Collation. However, the MySQL 5.5.50 server that I have access to does not have any collations with more than one suffix, and the only suffixes I see are: _cs
, _ci
, and _bin
across 198 total Collations. I used the SHOW COLLATION command to list them.
So, while it sounds like MySQL uses a similar naming convention (at least as far as those two options go), I cannot find a Collation matching what you are looking for. However, it might be possible to strip off the accents (and other diacritical marks) and use a _cs
collation to get what you want (similar to how you would do it in PostgreSQL -- see below). But I am not sure of this option and do not have time at the moment to research further.
OR, you might could create your very own Collation to do exactly what you want. Unlike the other RDBMSs, MySQL appears to make it rather simple to add your own Collations, in which case you are in full control over the weighting of each character. Please see Adding a Simple Collation to an 8-Bit Character Set and Adding a UCA Collation to a Unicode Character Set for more details.
For more info on how MySQL handles different types of Collations, please see their Collation Implementation Types page.
PostgreSQL
Collations in PostgreSQL seem to be far less flexible. You specify only the culture / locale: en_US
, de_DE
, etc. Please see their documentation page for Collation Support for details. Hence, by default you get the culture-specific overrides, but the Collations are otherwise everything-sensitive (which, by the way, is not the same as a "binary" collation).
You can use ILIKE (section 9.7.1) to get case insensitivity, but they do not have a similar operator for accent sensitivity. However, I found that they do have an unaccent function that can be used to strip off accents and other diacritical marks. Please note that this function is an Additional Supplied Module and hence is not necessarily present in any particular PostgreSQL server to use. That most recently linked documentation states:
When building from the source distribution, these components are not built automatically, unless you build the "world" target
...
If you are using a pre-packaged version of PostgreSQL, these modules are typically made available as a separate subpackage, such as postgresql-contrib.
Please see that documentation for instructions on how to get that function if you don't have it and want it.
More information can also be found in the following Stack Overflow answer:
Does PostgreSQL support “accent insensitive” collations?
DB2 (IBM)
Similar to Microsoft SQL Server, DB2 has two types of Collations:
"SYSTEM" Collations, which are specified using the following format:
SYSTEM_{codepage}_[optional-territory]
. These are not very flexible, and do not appear to support tailoring sensitivity to case, accents, or anything. You can find the list of supported Collations here: Supported territory codes and code pagesUnicode Collation Algorithm (UCA)-based Collations. These do support quite a bit of tailoring. Please see their Unicode Collation Algorithm based collations page for details about how to configure the behavior, the naming convention, and the list of valid locales. Please note that in Table 1, the example in the third row ("Case Level") starts with:
Setting the Case Level attribute to on and the Strength attribute to primary level will ignore accent but not case.
That is exactly what you were looking for. But, the syntax for that is:
CLDR181_EO_S1
. And this is why your searching did not find anything related to DB2.
Oracle
Oracle 10g added support for doing accent insensitive comparisons and sorting. However:
- they only have the options to denote "insensitive" operations:
_CI
and_AI
- you can only specify one of those options at a time
- the case-insensitive option --
_CI
-- is still accent sensitive - the accent-insensitive option --
_AI
-- "is always case-insensitive as well." (quoted from their documentation which is linked below)
Please see their Linguistic Sorting and String Searching documentation page for more details and examples.
SAP ASE (formerly Sybase ASE, a.k.a. Sybase)
ASE supports one or more of the following combinations of sensitivities per each locale / character set:
- case-sensitive, accent-sensitive
- case-insensitive, accent-sensitive
- case-insensitive, accent-sensitive, order with preference
- case-insensitive, accent-insensitive
You can see the relationship between locale, character set, and available sort orders on their Selecting the Default Sort Order page. And you can see the full list of Collations on their Collation Names and IDs page.
Their Collation naming convention is arbitrary in that they are all 4 - 8 characters and try to capture the locale name or code page and and some sense of the sorting. For example:
altnoacc
== "CP 850 Alternative – no accent"
rusdict
== "Russian dictionary ordering"
dynix
== "Chinese phonetic ordering"
There is a note on their Selecting the Default Unicode Sort Order page stating:
You can add sort orders using external files in the
$/collate/Unicode
directory. The names and collation IDs are stored insyscharsets
. The names of external Unicode sort orders do not have to be insyscharsets
before you can set the default Unicode sort order.
...
External Unicode sort orders are provided by SAP. Do not attempt to create external Unicode sort orders.
It is unclear whether or not SAP would supply an external sort order to allow for Case-Sensitive and Accent-Insensitive. Maybe someday I will email them and ask if one could be requested.
In order to get the desired combination of sensitivities, you should be able to create a scalar user-defined function to strip off accents and other diacritical marks.
Informix (bought by IBM)
Informix appears to mostly just support the default sorting and comparison behavior of a Collation. Hence Collations are just the locale and character set. Case-sensitivity is handled at the database-level, and by default they are case-sensitive. You can set a database (not a table, or a column, or a query, or even a predicate) to be case-insensitive by specifying NLSCASE INSENSITIVE in the CREATE DATABASE
statement.
While the database Collation -- locale and character set -- can be overridden per client connection, there does not seem to be a way to override the case-sensitivity setting. AND, the NLSCASE
option has "NLS" in the name for a reason: it only affects NCHAR
and NVARCHAR
data; CHAR
and VARCHAR
are always case-sensitive.
Accent-sensitivity is not addressed, nor is there a built-in function to strip of the accents / diacritical marks.
The Informix Collation naming convention is:
<lang>_<country>.<code set>
where:
<lang>
= a 2-letter or 3-letter language code<country>
= a 2-letter country or region code<code set>
= the code page specified in one of the 3 following equivalent ways:- name: 8859-1
- decimal value of IBM CCSID number: 819
- hexadecimal value of IBM CCSID number: 0333
Hence, the following three locale specifications all refer to the exact same locale:
- fr_fr.8859-1
- fr_fr.819
- fr_fr.0333
For more information, please see:
Mostly "yes", depending on how "automagic" you want everything to work.
First: Database identifier Collation (i.e. object names, column names, index names, etc) is taken from the Database's default Collation. That will determine the Collation of columns such as sys.objects.name
, etc. So to do this, you can create or alter the Database to have a case-Insensitive Collation, such as Latin1_General_100_CI_AS_SC
. This will allow you to have a table TEST
that will be valid in queries such as select * from test;
.
Server-scoped identifier Collation (e.g. Database names) is handled by the Instance's default Collation.
Second: If a Database's default Collation is case-Insensitive, then expressions that are entirely string literals and/or variables and/or output-parameters and/or UDF return values will be treated as case-Insensitive. For these expressions you would need to override that default behavior, per expression, using the COLLATE
clause.
For example:
/* take database default collation behavior */
IF ('test' = 'TEST')
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- ? result depends on the default collation of the "current" database ?
/* force case-sensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CS_AS_SC)
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- Case-Sensitive
/* force case-insensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CI_AS_SC)
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- Case-Insensitive
Expressions between string columns and either string literals or variables will use the Collation of the string column, and if the Collation for the column is case-sensitive, then the expression will be evaluated as case-sensitive without needing to use the COLLATE
clause.
By default, newly created columns that do not specify the COLLATE
clause will inherit the Database's default Collation. In this case, you will need to specify something like COLLATE Latin1_General_100_CS_AS_KS_WS_SC
for all VARCHAR
/ NVARCHAR
columns that you want to be sensitive across all 4 categories (plus support Supplementary Characters).
PLEASE NOTE: Be very cautious about using a binary Collation (i.e. Latin1_General_100_BIN2
) to get an accent/case/kana/width sensitive Collation; binary Collations are not truly case-sensitive.
Related Question
- SQL Server LIKE – How to Perform Case-Insensitive Search in Case-Sensitive Database
- SQL Server Collation – How Case-Insensitive Collation Works
- Sql-server – Change collation on partition column (SQL Server 2017)
- PostgreSQL – Case-Insensitive Collation Comparing Case-Sensitive
- Postgresql – Case Insensitive ORDER BY clause using COLLATE
Best Answer
There is a lot of information missing from the question, such as:
Generally speaking, there is no obvious / default answer since there is no single mechanism / approach for migrating data, nor is there a singular definition of "cloud" in this context. I would expect that some options (such as restoring from a .bak backup file) would keep the Database's default Collation intact, but that does not imply anything about the Instance-level Collation.
What this comes down to is: even without having more detailed information to go on, since there is a Collation difference at some level (Instance and/or Database), it would seem that the vendor that performed the migration was not careful to ensure that either a) the exact same Collation was being used if given a choice to set a new Collation, or b) that you were notified that there would be a change of Collation at the specific level(s) if there was no choice and it was going to be different than the current Collation(s).
I would definitely contact the vendor regarding this issue. Typically, if there is a default Collation that cannot be changed, it is only the Instance-level, and is case-insensitive:
SQL_Latin1_General_CP1_CI_AS
(unfortunately)Latin1_General_100_CI_AS_WS_KS_SC
SQL_Latin1_General_CP1_CI_AS
(unfortunately)References: