Is there any documentation anywhere for the exact sorting/equality rules defined by a collation in Windows (SQL Server), for instance "Finnish_Swedish_CI_AS"?
I know about CI/CS, AI/AS etc, but I'm looking for a table or set of rules that actually states that e.g. 'a' is sorted before 'b' and so on. Maybe something similar to this page: http://collation-charts.org/mssql/mssql.040B.1252.Finnish_Swedish_CI_AS.html The problem with that page is that I'm unclear on some aspects of how to read it. Also it doesn't indicate that æ is regarded as equal to 'ae', and I'm wondering if there are more such cases or other rules not shown.
Also, I'm trying to find the exact difference between Finnish_Swedish_CI_AS and Finnish_Swedish_100_CI_AS
Best Answer
Unfortunately no, no such documentation exists. The most likely reason for this is that it would be much more difficult to document than most people are aware of, due to the actual complexity of the rules. While there is a default weight to each character (even if that weight is 0), how that weight is used is determined by several layers of rules. Some of those rules are dependent on the Collation (e.g. case, accent, Kana, and width sensitivity) and some are simply hard-coded (e.g. sorting lower-case before upper-case in most, but not all, cases). But those types of rules are just some of the types of rules that are processed.
Yes, there are simple mappings of
æ
=ae
, but there are more complex mappings involving combining characters (characters that are not displayed on their own, but are super-imposed over base characters). For example:Of course, some "characters" are comprised of a base character and multiple combining characters.
Also, some weights depend on context (i.e. where the character is placed in relation to other characters). For example, a hyphen (i.e. == minus sign ==
-
) by itself sorts before letters. A simple test shows this behavior:Both Collations have the same ordering here:
Now let's place the hyphen between some letters. First we will test with a SQL Server Collation (which uses the older, simpler "string sort"):
That returns:
Here there is no difference in that the
-
still comes before the letters, which is why botha-
rows came before both theaa
andac
rows. Now use a Windows Collation:That returns:
What you are seeing here is that the hyphen, due to being between letters, is ignored (and yes, it is ignored only when sorting; when doing a comparison, it is not ignored, which can be seeing by running:
SELECT 1 WHERE 'a-b' = 'ab' COLLATE Latin1_General_CI_AS;
). This is whya-b
sorts betweenaaa
andaca
as it is seen as beingab
, anda-d
comes afteraca
as it is seen as beingad
.Two other complicating factors are:
All that being said, you can still probably get pretty close to what you are looking for by examining some of the base data files from Unicode.org. But first, we will need to know which locale-specific customization file to look at. We can find this by getting the LCID of the Collation in question:
Now we can look for
040B
on the following page: LCID Structure.We see that
0x040B
refers tofi-FI
, which means that we will be looking for a file having eitherfi-FI
or justfi
in its name.æ → ae
, etc.)1035
/0x040B
meaning "fi")