This may be too late for the original poster, but for completeness, the way to achieve case insensitive behaviour from PostgreSQL is to set a non-deterministic collation. This is only for Postgres 12.
Details are described in docs here. Reproducing relevant portion for completeness:
A collation is either deterministic or nondeterministic. A
deterministic collation uses deterministic comparisons, which means
that it considers strings to be equal only if they consist of the same
byte sequence. Nondeterministic comparison may determine strings to be
equal even if they consist of different bytes. Typical situations
include case-insensitive comparison, accent-insensitive comparison, as
well as comparison of strings in different Unicode normal forms. It is
up to the collation provider to actually implement such insensitive
comparisons; the deterministic flag only determines whether ties are
to be broken using bytewise comparison. See also Unicode Technical
Standard 10 for more information on the terminology.
To create a nondeterministic collation, specify the property
deterministic = false to CREATE COLLATION, for example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
This example would use the standard Unicode collation in a
nondeterministic way. In particular, this would allow strings in
different normal forms to be compared correctly. More interesting
examples make use of the ICU customization facilities explained above.
For example:
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
All standard and predefined collations are deterministic, all
user-defined collations are deterministic by default. While
nondeterministic collations give a more “correct” behavior, especially
when considering the full power of Unicode and its many special cases,
they also have some drawbacks. Foremost, their use leads to a
performance penalty. Also, certain operations are not possible with
nondeterministic collations, such as pattern matching operations.
Therefore, they should be used only in cases where they are
specifically wanted.
Mostly "yes", depending on how "automagic" you want everything to work.
First: Database identifier Collation (i.e. object names, column names, index names, etc) is taken from the Database's default Collation. That will determine the Collation of columns such as sys.objects.name
, etc. So to do this, you can create or alter the Database to have a case-Insensitive Collation, such as Latin1_General_100_CI_AS_SC
. This will allow you to have a table TEST
that will be valid in queries such as select * from test;
.
Server-scoped identifier Collation (e.g. Database names) is handled by the Instance's default Collation.
Second: If a Database's default Collation is case-Insensitive, then expressions that are entirely string literals and/or variables and/or output-parameters and/or UDF return values will be treated as case-Insensitive. For these expressions you would need to override that default behavior, per expression, using the COLLATE
clause.
For example:
/* take database default collation behavior */
IF ('test' = 'TEST')
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- ? result depends on the default collation of the "current" database ?
/* force case-sensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CS_AS_SC)
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- Case-Sensitive
/* force case-insensitive */
IF ('test' = 'TEST' COLLATE Latin1_General_100_CI_AS_SC)
BEGIN
PRINT 'Case-Insensitive';
END;
ELSE
BEGIN
PRINT 'Case-Sensitive';
END;
-- Case-Insensitive
Expressions between string columns and either string literals or variables will use the Collation of the string column, and if the Collation for the column is case-sensitive, then the expression will be evaluated as case-sensitive without needing to use the COLLATE
clause.
By default, newly created columns that do not specify the COLLATE
clause will inherit the Database's default Collation. In this case, you will need to specify something like COLLATE Latin1_General_100_CS_AS_KS_WS_SC
for all VARCHAR
/ NVARCHAR
columns that you want to be sensitive across all 4 categories (plus support Supplementary Characters).
PLEASE NOTE: Be very cautious about using a binary Collation (i.e. Latin1_General_100_BIN2
) to get an accent/case/kana/width sensitive Collation; binary Collations are not truly case-sensitive.
Best Answer
TL;DR
There is no such thing as a "vendor-agnostic" view of Collations, nor even "version-agnostic", since their implementations -- including which aspects can be made insensitive and their naming conventions -- are vendor-specific and change over time.
Here is a summary of what I have found, and the details are in the longer section below the line:
As you can see in the chart, two of the seven RDBMSs do natively support "Case-sensitive and Accent-insensitive" operations via Collations, though they have different naming conventions (and several other functional differences).
One RDBMS -- PostgreSQL -- doesn't natively support this combination, but you can still achieve it by stripping off the accents with the
unaccent()
add-on function.The last four RDBMSs, two of which have a similar naming convention for the options, neither natively support this combination nor does there appear to be a means of accomplishing this without writing your own function to remove the accents / diacritical marks. MySQL allows for creating your own Collations, but that requires that you then add it to source control and incorporate it into your testing & deployment process so that it can be applied to all servers in all environments (but still a very cool and flexible option). SAP ASE mentions that SAP can supply additional Unicode sort orders, but no mention of what they might be willing to supply.
With regards to:
I can say that in doing the research for this answer I came across a lot of instances of people wanting case-insensitive and accent-sensitive for MySQL, but few, if any, asking for your desired combination.
You were unsuccessful in your search because it doesn't really make sense to look for an RDBMS based on a Collation specification. That's just not how Collations work. And while you are wanting to approach this as vendor-agnostic, reality is that Collations -- at least the part that we interact with -- are very much vendor-specific, and don't always fit into the scheme that you were searching on.
String comparison and sorting is highly complex, and there are different ways of performing these rules. One method is to have mappings that take into account one or more rules. Hence the four combinations of Sensitive and Insensitive for Case and Accents would equate to four separate mappings. For example, you saw this on that MSDN page for SQL Server Collation Name. If you scroll down, you will see that the left column of the chart is the
Sort Order ID
. Each Collation has a different ID:SQL_Latin1_General_Cp1_CI_AS
= 52 whileSQL_Latin1_General_Cp1_CS_AS
= 51, even though the only difference is in the case-sensitivity.Or, it can be rule-based, such as what Unicode offers through the Unicode Collation Algorithm (UCA). In this approach, every character is given, by default, one or more weights. Then, each culture / locale has the option to override any of those weights, or remove rules, or add rules. The algorithm takes into account any locale-specific rules, and then potentially manipulates those weights based on any options chosen (sensitivity, which case comes first when doing case-sensitive sorts, etc). This is one reason why doing Unicode sorting is a bit slower than non-Unicode sorting.
To get a sense of how many options there really are (i.e. the actual complexity), check out this demo from the ICU (International Components for Unicode) project:
ICU Collation Demo
There are 8 separate options to specify, and some of them get represented in multiple elements of the Collation name specification that you are thinking of (e.g.
CS
,CI
,AS
,AI
, etc). Given how many variations there are, using the mapping file approach where each combination has its own ID would result in many thousands of files. Many of those files would need to get updated whenever there are changes in those particular languages, or when bugs are found. This is probably why there are only 75 of those type of Collations in SQL Server 2012 (i.e. those with names starting withSQL_
). Hence no combination for_CS_AI
.And the reason why you couldn't find that combination for the UCA-based Collations? Well, there are 3810 Collations in SQL Server 2012 that do not begin with
SQL_
, so 3885 Collations total. That list seems to be too long to be enumerated fully on a web page. But this doesn't fully explain why you couldn't find this combination for other vendors.Beyond what has already been mentioned (i.e. too many combinations to implement, and too many implementations to list), you still need to contend with vendor-specific implementations. Meaning: not all vendors allow for tailoring all of those options, and there is no standard naming convention for Collations in the first place. Also, not all vendors view the sorting options as being a part of the Collation: PostgreSQL Collations are default ordering for the chosen locale, and you need to use
ILIKE
to get a case-insensitive comparison. See below for vendor-specific info.SQL Server (Microsoft)
The distinction between what you are seeing on those two MSDN documentation pages and the query provided by @MartinSmith in a comment on the question (slightly revised below):
is that those two MSDN pages are referring specifically to the very deprecated SQL Server Collations, whereas the collations that show up as a result of that query (888 of them as of SQL Server 2012, SP3) are Windows Collations.
Starting in SQL Server 2000, the older SQL Server Collations (created prior to SQL Server being able to tap into the Windows Collations) are deprecated and are not being updated with new rules or functionality. For example, starting in SQL Server 2012, a set of Collations was added that support the proper handling of the built-in functions for Supplementary Characters (i.e. the remaining UTF-16 characters beyond the "base" 65,536 characters initially defined in UCS-2). These newer Collations end in
_SC
(as in Supplementary Characters).It is best to not use the SQL Server Collations -- those with names starting with
SQL_
. Hence you do have access to plenty of Collations that support the combination of options that you are looking for (i.e. Case-Sensitive and Accent-Insensitive). Whenever available, it is also best to use the one end in_SC
as long as it has all of the other options you are wanting.While SQL Server does use the
_CS_AI
naming convention, there is no list of all 3810 (as of SQL Server 2012) Windows Collations. There is just the Windows Collation Name page that lists all of the locales and versions, and how the naming convention works, but that's it.SQL Server also supports toggling both Width and Kana sensitivity.
MySQL (bought by Oracle)
The MySQL version 5.7, documentation states that it does support the
_ai
,_as
,_ci
, and_cs
suffixes (and_bin
for completeness), but also states:This certainly implies that it is possible to have a
latin1_general_cs_ai
Collation. However, the MySQL 5.5.50 server that I have access to does not have any collations with more than one suffix, and the only suffixes I see are:_cs
,_ci
, and_bin
across 198 total Collations. I used the SHOW COLLATION command to list them.So, while it sounds like MySQL uses a similar naming convention (at least as far as those two options go), I cannot find a Collation matching what you are looking for. However, it might be possible to strip off the accents (and other diacritical marks) and use a
_cs
collation to get what you want (similar to how you would do it in PostgreSQL -- see below). But I am not sure of this option and do not have time at the moment to research further.OR, you might could create your very own Collation to do exactly what you want. Unlike the other RDBMSs, MySQL appears to make it rather simple to add your own Collations, in which case you are in full control over the weighting of each character. Please see Adding a Simple Collation to an 8-Bit Character Set and Adding a UCA Collation to a Unicode Character Set for more details.
For more info on how MySQL handles different types of Collations, please see their Collation Implementation Types page.
PostgreSQL
Collations in PostgreSQL seem to be far less flexible. You specify only the culture / locale:
en_US
,de_DE
, etc. Please see their documentation page for Collation Support for details. Hence, by default you get the culture-specific overrides, but the Collations are otherwise everything-sensitive (which, by the way, is not the same as a "binary" collation).You can use ILIKE (section 9.7.1) to get case insensitivity, but they do not have a similar operator for accent sensitivity. However, I found that they do have an unaccent function that can be used to strip off accents and other diacritical marks. Please note that this function is an Additional Supplied Module and hence is not necessarily present in any particular PostgreSQL server to use. That most recently linked documentation states:
Please see that documentation for instructions on how to get that function if you don't have it and want it.
More information can also be found in the following Stack Overflow answer:
Does PostgreSQL support “accent insensitive” collations?
DB2 (IBM)
Similar to Microsoft SQL Server, DB2 has two types of Collations:
"SYSTEM" Collations, which are specified using the following format:
SYSTEM_{codepage}_[optional-territory]
. These are not very flexible, and do not appear to support tailoring sensitivity to case, accents, or anything. You can find the list of supported Collations here: Supported territory codes and code pagesUnicode Collation Algorithm (UCA)-based Collations. These do support quite a bit of tailoring. Please see their Unicode Collation Algorithm based collations page for details about how to configure the behavior, the naming convention, and the list of valid locales. Please note that in Table 1, the example in the third row ("Case Level") starts with:
That is exactly what you were looking for. But, the syntax for that is:
CLDR181_EO_S1
. And this is why your searching did not find anything related to DB2.Oracle
Oracle 10g added support for doing accent insensitive comparisons and sorting. However:
_CI
and_AI
_CI
-- is still accent sensitive_AI
-- "is always case-insensitive as well." (quoted from their documentation which is linked below)Please see their Linguistic Sorting and String Searching documentation page for more details and examples.
SAP ASE (formerly Sybase ASE, a.k.a. Sybase)
ASE supports one or more of the following combinations of sensitivities per each locale / character set:
You can see the relationship between locale, character set, and available sort orders on their Selecting the Default Sort Order page. And you can see the full list of Collations on their Collation Names and IDs page.
Their Collation naming convention is arbitrary in that they are all 4 - 8 characters and try to capture the locale name or code page and and some sense of the sorting. For example:
altnoacc
== "CP 850 Alternative – no accent"rusdict
== "Russian dictionary ordering"dynix
== "Chinese phonetic ordering"There is a note on their Selecting the Default Unicode Sort Order page stating:
It is unclear whether or not SAP would supply an external sort order to allow for Case-Sensitive and Accent-Insensitive. Maybe someday I will email them and ask if one could be requested.
In order to get the desired combination of sensitivities, you should be able to create a scalar user-defined function to strip off accents and other diacritical marks.
Informix (bought by IBM)
Informix appears to mostly just support the default sorting and comparison behavior of a Collation. Hence Collations are just the locale and character set. Case-sensitivity is handled at the database-level, and by default they are case-sensitive. You can set a database (not a table, or a column, or a query, or even a predicate) to be case-insensitive by specifying NLSCASE INSENSITIVE in the
CREATE DATABASE
statement.While the database Collation -- locale and character set -- can be overridden per client connection, there does not seem to be a way to override the case-sensitivity setting. AND, the
NLSCASE
option has "NLS" in the name for a reason: it only affectsNCHAR
andNVARCHAR
data;CHAR
andVARCHAR
are always case-sensitive.Accent-sensitivity is not addressed, nor is there a built-in function to strip of the accents / diacritical marks.
The Informix Collation naming convention is:
where:
<lang>
= a 2-letter or 3-letter language code<country>
= a 2-letter country or region code<code set>
= the code page specified in one of the 3 following equivalent ways:Hence, the following three locale specifications all refer to the exact same locale:
For more information, please see: