SQL Server Collation – Where to Find Exact Rules

collationsql serverunicodewindows

Is there any documentation anywhere for the exact sorting/equality rules defined by a collation in Windows (SQL Server), for instance "Finnish_Swedish_CI_AS"?

I know about CI/CS, AI/AS etc, but I'm looking for a table or set of rules that actually states that e.g. 'a' is sorted before 'b' and so on. Maybe something similar to this page: http://collation-charts.org/mssql/mssql.040B.1252.Finnish_Swedish_CI_AS.html The problem with that page is that I'm unclear on some aspects of how to read it. Also it doesn't indicate that æ is regarded as equal to 'ae', and I'm wondering if there are more such cases or other rules not shown.

Also, I'm trying to find the exact difference between Finnish_Swedish_CI_AS and Finnish_Swedish_100_CI_AS

Best Answer

Unfortunately no, no such documentation exists. The most likely reason for this is that it would be much more difficult to document than most people are aware of, due to the actual complexity of the rules. While there is a default weight to each character (even if that weight is 0), how that weight is used is determined by several layers of rules. Some of those rules are dependent on the Collation (e.g. case, accent, Kana, and width sensitivity) and some are simply hard-coded (e.g. sorting lower-case before upper-case in most, but not all, cases). But those types of rules are just some of the types of rules that are processed.

Yes, there are simple mappings of æ = ae, but there are more complex mappings involving combining characters (characters that are not displayed on their own, but are super-imposed over base characters). For example:

SELECT NCHAR(252), NCHAR(0x0308), N'u' + NCHAR(0x0308)
-- ü    ̈   ü

IF (NCHAR(252) = N'u' + NCHAR(0x0308)) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

IF (NCHAR(252) = N'u') SELECT 'Same!' ELSE SELECT 'Nope.';
-- Nope.

IF (NCHAR(252) = N'u' COLLATE Latin1_General_100_CI_AI) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

Of course, some "characters" are comprised of a base character and multiple combining characters.

Also, some weights depend on context (i.e. where the character is placed in relation to other characters). For example, a hyphen (i.e. == minus sign == - ) by itself sorts before letters. A simple test shows this behavior:

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

Both Collations have the same ordering here:

-
a
c

Now let's place the hyphen between some letters. First we will test with a SQL Server Collation (which uses the older, simpler "string sort"):

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

That returns:

-
a
a-b
a-d
aaa
aca
c

Here there is no difference in that the - still comes before the letters, which is why both a- rows came before both the aa and ac rows. Now use a Windows Collation:

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

That returns:

-
a
aaa
a-b
aca
a-d
c

What you are seeing here is that the hyphen, due to being between letters, is ignored (and yes, it is ignored only when sorting; when doing a comparison, it is not ignored, which can be seeing by running: SELECT 1 WHERE 'a-b' = 'ab' COLLATE Latin1_General_CI_AS;). This is why a-b sorts between aaa and aca as it is seen as being ab, and a-d comes after aca as it is seen as being ad.

Two other complicating factors are:

What versions of the Unicode Character Database (UCD) and Common Locale Data Repository (CLDR) are being used. It is not as straight-forward as one (or many) would hope, and the older versions of the Unicode standard either do not seem to have all of the base info, or at least they certainly don't make it easy to find
How precise is Microsoft's implementation of the Unicode standard? After all, it is just a standard and can be implemented differently by different vendors.

All that being said, you can still probably get pretty close to what you are looking for by examining some of the base data files from Unicode.org. But first, we will need to know which locale-specific customization file to look at. We can find this by getting the LCID of the Collation in question:

SELECT COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'),
       CONVERT(VARBINARY(3), COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'));
-- 1035       0x00040B

Now we can look for 040B on the following page: LCID Structure.

We see that 0x040B refers to fi-FI, which means that we will be looking for a file having either fi-FI or just fi in its name.

Explanations (some "light" reading ;-):
Data:
- Transformations (æ → ae, etc.)
- Remapped Unicode Element Table for use with CLDR
- Root collation order
- Finnish customizations (based on LCID of 1035 / 0x040B meaning "fi")

Related Solutions

Sql-server – Comparing binary 0x and 0x00 turns out to be equal on SQL Server

I couldn't find this comparison behaviour specified anywhere in BOL.

But the Connect Item Invalid equality comparison for varbinary data with right padded zeros states that

Basically, the standard leaves it up to implementation to treat strings that differ only by [trailing] 00 as equal or less. We treat it as equal.

The Connect Item also states that the presence of trailing zeroes is the only case in which SQL Server differs from byte-by-byte comparison behavior.

In order to distinguish between two binary values in SQL Server that differ only by trailing 0x00 characters you can also add DATALENGTH into the comparison as indicated in your question.

The reason for preferring DATALENGTH rather than LEN generally here is because the latter gives an implicit cast to varchar and then you get the problem with trailing spaces.

+-------------+--------------------+
| LEN(0x2020) | DATALENGTH(0x2020) |
+-------------+--------------------+
|           0 |                  2 |
+-------------+--------------------+

Though either would work in your use case.

MySQL – How to Find Duplicate Key Violations in Different Collation

A query like this did the trick:

select date, thing_id, email_domain collate utf8_unicode_ci
from my_table
where date = '2013-07-15'
and thing_id = 119
and email_domain collate utf8_unicode_ci = 'comcast.net';

Best Answer

Related Solutions

Sql-server – Comparing binary 0x and 0x00 turns out to be equal on SQL Server

MySQL – How to Find Duplicate Key Violations in Different Collation

Related Question