Besides the non-C locale, what else is messing up the sort

localesort

I'm using Ubuntu 16.04_xfce xenial. This more than setting the correct locale, or using "natural order" sort operands.

I sorted the apt sources file. All lines start with "#", "##", or "deb". I expected to see all blank lines, all lines with "#", then with "##", finally those starting "deb". Look about 9 lines down, then 25 lines down in my output:

root@HEJ ~ $ sort /etc/apt/sources.list







## Also, please note that software in backports WILL NOT receive any review
# deb cdrom:[Xubuntu 16.04.1 LTS _Xenial Xerus_ - Release i386 (20160719)]/ xenial main multiverse restricted univer
# deb http://archive.canonical.com/ubuntu xenial partner
deb http://archive.canonical.com/ubuntu/ xenial partner
deb http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-backports main restricted universe multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates universe
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
deb http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
deb http://security.ubuntu.com/ubuntu/ xenial-security restricted universe multiverse main
deb http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://archive.canonical.com/ubuntu xenial partner
# deb-src http://archive.canonical.com/ubuntu/ xenial partner
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
# deb-src http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb-src http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
# deb-src http://security.ubuntu.com/ubuntu xenial-security main restricted
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-backports main restricted universe multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates universe
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.scootersoftware.com/ bcompare4 non-free
## distribution.
## extensively as that contained in the main release, although it includes
## Major bug fix updates produced after the final release of the
## multiverse WILL NOT receive any review or updates from the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu 
## N.B. software from this repository may not have been tested as
## newer versions of some applications which may provide useful features.
# newer versions of the distribution.
## or updates from the Ubuntu security team.
## 'partner' repository.
## respective vendors as a service to Ubuntu users.
## security team.
# See http://help.ubuntu.com/community/UpgradeNotes for how to upgrade to
## team.
## team, and may not be under a free licence. Please satisfy yourself as to
## team, and may not be under a free licence. Please satisfy yourself as to 
## This software is not part of Ubuntu, but is offered by Canonical and the
## Uncomment the following two lines to add software from Canonical's
## universe WILL NOT receive any review or updates from the Ubuntu security
## your rights to use the software. Also, please note that software in
## your rights to use the software. Also, please note that software in

The locale settings in effect:

root@HEJ ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Research shows that I need to over-ride LC_COLLATE="en_US.UTF-8" to LC_COLLATE="C.UTF-8" (or better yet LC_ALL=C) in order to get rational output. But there is one further issue here…

If this were only a question of character ordering, then all the "# " should sort together, and all the "##" sort together.

But what appears to be happening is that the "#" and "##" are removed from the sort keys, and I can't believe this a function of collation order.

What else is mucking with my sort keys?

And while we're on the subject of collation order, Where the is the binary order of characters in specific locales documented?, i.e. a human readable lists of each possible character arranged in collated order?

(A pox upon those who did not make the en_US locale definition upward compatible!)

Best Answer

This behaviour is purely a function of collation ordering controlled by your LC_COLLATE locale. Because you have a Unicode locale set, glibc uses the specified Unicode collation order in one of its defined variants, which attempts to be a somewhat "natural" sort.

This ordering is the UTS 10 Unicode Collation Algorithm ordering, with shift trimming of variable collation elements, and using (likely) the default collation element table. In effect, characters like #, but also most other punctuation and whitespace, are treated as less significant than differences between following alphanumeric characters, and only used to break ties. The entire algorithm is defined in some detail in the standard and it gets more complex still.

It is sometimes advised not to set LANG or LC_COLLATE for this reason. You can instead set LC_CTYPE (to UTF-8) and LC_MESSAGES (to your preferred message language), and keep collation at the POSIX default. There are flow-on effects either way to that choice.

On your system, this is probably defined in /usr/share/i18n/locales/iso14651_t1_common, which is included by iso14651_t1, which is included by en_US. Other locale's orderings are defined in the nearby files, commonly based on the same default with localised changes (for example, sv_SE uses the same basis, but reorders ...zåäöø, collapses v and w, etc). This table, selected by LC_COLLATE, is what actually determines the behaviour on your system, and is derived from (a past version of) the Unicode standard. On newer or older systems, using different Unicode versions, the same strings may compare differently.

Other encodings will have their own separate tables that may be entirely unrelated.

You can check the behaviour of your system against the specification by sorting a file containing strings from the comparison tables provided in the UTR:

demark
de‐Luge
death
deluge
☠sad
de-luge
de Luge
☠happy
de‐luge
♡sad
deLuge
de luge
♡happy
de-Luge

(there are both hyphens and hyphen-minuses in those words)

The order you should get is:

death
deluge
de luge
de-luge
de‐luge
deLuge
de Luge
de-Luge
de‐Luge
demark
☠happy
♡happy
☠sad
♡sad

(Some) expository explanation is given for that result in the report:

Shifted. The hyphen-minus and hyphen are grouped together, and their
differences are less significant than the casing differences in the
letter "l". This grouping results from the fact that they are
ignorable, but their fourth level differences are according to the
original primary order, which is more intuitive than Unicode order.
The symbols ☠ and ♡ are ignored on levels 1-3.

Shift-Trimmed. Note how “deLuge” comes between the cased versions with spaces and hyphens. The symbols ☠ and ♡ are ignored on levels 1-3.

It's a bit dense. "Levels 1-3" are different levels of tiebreak weight in the algorithm, with the primary level 1 being the most important differentiator. This is probably more information than you need already, but you can at least determine that it is the specified collation order that creates the result you're seeing.

Related Solutions

The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

All the locale variables use the same locale name so that you can specify your favorite locale in a single swoop, e.g. LANG=en_AU.utf8. As you surmise, the country information is occasionally relevant even in LC_CTYPE, e.g. the uppercase version of i is I in most languages but İ in Turkish (tr_TR.utf8). But don't expect miracles; for example the lowercase-uppercase correspondence is one-to-one, so there's no good uppercase version of ß in de_DE.iso8859-1 (it should be SS).

You'll have an easier time understanding the output of locale -k LC_CTYPE, with -k to see the keyword names in addition to the values (without -k, the output format is designed so you can get the value of a specific keyword, e.g. locale ctype-width). The list of keywords and their meanings is system-dependent, as is the way locale data is stored, and doesn't interest many people, so you may not find much documentation outside the source code of your C library. By far the most useful form of the locale command is locale -a to list available locale names.

For GNU libc (i.e. non-embedded Linux):

All locale data other than messages is stored in /usr/lib/locale/locale-archive. This file is generated by localedef from data in /usr/share/i18n and /usr/local/share/i18n. The format of the locale definition files in /usr/share/i18n/locales is only documented in the source code, I think.
The format of the character set and encoding definition files in /usr/share/i18n/charmaps is standardized by POSIX:2001. These files (or, in GNU libc, the compiled version in /usr/lib/locale/locale-archive) are used by the iconv programming and commmand line facility. Encoding conversions also rely on code in /usr/lib/gconv/*.so. The Gnu libc manual documents how to write your own gconv module, though that section contains the text “This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources.”.
Message catalogs get special treatment because each application comes with its own set. Message catalogs live in /usr/share/locale/*/LC_MESSAGES. The manual contains documentation for application writers. GNU libc supports both the POSIX interface catgets and the more powerful gettext interface.

Written languages are indeed very complicated, even if you don't stray far from English. Are the French and German ü the same character (is a “tréma” exactly the same as an “umlaut”, and does it matter that French and German printers typeset the accent at a slightly different height)? What is the uppercase of i (it's İ in Turkish)? Does Ö transliterate to O if you only have ASCII (in German, it's OE)? Where is Ä sorted in a dictionary (in Swedish, it's after Z)? And that's just a few examples with European languages written in the latin alphabet! The Unicode mailing list has a lot of examples and sometimes heated discussions on such topics.

Default Order of Linux Sort Command

Looks like you are using a non-POSIX locale.

Try:

export LC_ALL=C

and then sort.

info sort clearly says:

(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to `en_US'), then `sort' may produce output that is sorted differently than you're accustomed to. In that case, set the `LC_ALL' environment variable to `C'. Note that setting only `LC_COLLATE' has two problems. First, it is ineffective if `LC_ALL' is also set. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is set to an incompatible value. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

Best Answer

Related Solutions

The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

Default Order of Linux Sort Command

Related Question