SQL Server Collation – Where to Find Exact Rules

collationsql serverunicodewindows

Is there any documentation anywhere for the exact sorting/equality rules defined by a collation in Windows (SQL Server), for instance "Finnish_Swedish_CI_AS"?

I know about CI/CS, AI/AS etc, but I'm looking for a table or set of rules that actually states that e.g. 'a' is sorted before 'b' and so on. Maybe something similar to this page: http://collation-charts.org/mssql/mssql.040B.1252.Finnish_Swedish_CI_AS.html The problem with that page is that I'm unclear on some aspects of how to read it. Also it doesn't indicate that æ is regarded as equal to 'ae', and I'm wondering if there are more such cases or other rules not shown.

Also, I'm trying to find the exact difference between Finnish_Swedish_CI_AS and Finnish_Swedish_100_CI_AS

Best Answer

Unfortunately no, no such documentation exists. The most likely reason for this is that it would be much more difficult to document than most people are aware of, due to the actual complexity of the rules. While there is a default weight to each character (even if that weight is 0), how that weight is used is determined by several layers of rules. Some of those rules are dependent on the Collation (e.g. case, accent, Kana, and width sensitivity) and some are simply hard-coded (e.g. sorting lower-case before upper-case in most, but not all, cases). But those types of rules are just some of the types of rules that are processed.

Yes, there are simple mappings of æ = ae, but there are more complex mappings involving combining characters (characters that are not displayed on their own, but are super-imposed over base characters). For example:

SELECT NCHAR(252), NCHAR(0x0308), N'u' + NCHAR(0x0308)
-- ü    ̈   ü

IF (NCHAR(252) = N'u' + NCHAR(0x0308)) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

IF (NCHAR(252) = N'u') SELECT 'Same!' ELSE SELECT 'Nope.';
-- Nope.

IF (NCHAR(252) = N'u' COLLATE Latin1_General_100_CI_AI) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

Of course, some "characters" are comprised of a base character and multiple combining characters.

Also, some weights depend on context (i.e. where the character is placed in relation to other characters). For example, a hyphen (i.e. == minus sign == - ) by itself sorts before letters. A simple test shows this behavior:

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

Both Collations have the same ordering here:

-
a
c

Now let's place the hyphen between some letters. First we will test with a SQL Server Collation (which uses the older, simpler "string sort"):

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

That returns:

-
a
a-b
a-d
aaa
aca
c

Here there is no difference in that the - still comes before the letters, which is why both a- rows came before both the aa and ac rows. Now use a Windows Collation:

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

That returns:

-
a
aaa
a-b
aca
a-d
c

What you are seeing here is that the hyphen, due to being between letters, is ignored (and yes, it is ignored only when sorting; when doing a comparison, it is not ignored, which can be seeing by running: SELECT 1 WHERE 'a-b' = 'ab' COLLATE Latin1_General_CI_AS;). This is why a-b sorts between aaa and aca as it is seen as being ab, and a-d comes after aca as it is seen as being ad.

Two other complicating factors are:

  • What versions of the Unicode Character Database (UCD) and Common Locale Data Repository (CLDR) are being used. It is not as straight-forward as one (or many) would hope, and the older versions of the Unicode standard either do not seem to have all of the base info, or at least they certainly don't make it easy to find
  • How precise is Microsoft's implementation of the Unicode standard? After all, it is just a standard and can be implemented differently by different vendors.

All that being said, you can still probably get pretty close to what you are looking for by examining some of the base data files from Unicode.org. But first, we will need to know which locale-specific customization file to look at. We can find this by getting the LCID of the Collation in question:

SELECT COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'),
       CONVERT(VARBINARY(3), COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'));
-- 1035       0x00040B

Now we can look for 040B on the following page: LCID Structure.

We see that 0x040B refers to fi-FI, which means that we will be looking for a file having either fi-FI or just fi in its name.