Besides the non-C locale, what else is messing up the sort

localesort

I'm using Ubuntu 16.04_xfce xenial. This more than setting the correct locale, or using "natural order" sort operands.

I sorted the apt sources file. All lines start with "#", "##", or "deb". I expected to see all blank lines, all lines with "#", then with "##", finally those starting "deb". Look about 9 lines down, then 25 lines down in my output:

root@HEJ ~ $ sort /etc/apt/sources.list







## Also, please note that software in backports WILL NOT receive any review
# deb cdrom:[Xubuntu 16.04.1 LTS _Xenial Xerus_ - Release i386 (20160719)]/ xenial main multiverse restricted univer
# deb http://archive.canonical.com/ubuntu xenial partner
deb http://archive.canonical.com/ubuntu/ xenial partner
deb http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-backports main restricted universe multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-security universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial universe
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates main restricted
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates multiverse
deb http://mirror.csclub.uwaterloo.ca/ubuntu/ xenial-updates universe
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
deb http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
deb http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
deb http://security.ubuntu.com/ubuntu/ xenial-security restricted universe multiverse main
deb http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://archive.canonical.com/ubuntu xenial partner
# deb-src http://archive.canonical.com/ubuntu/ xenial partner
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.csclub.uwaterloo.ca/debian-multimedia/ stable main
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-backports main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial main restricted universe multiverse
# deb-src http://mirror.cs.pitt.edu/ubuntu/archive xenial-updates main restricted universe multiverse
# deb-src http://ppa.launchpad.net/cdemu/ppa/ubuntu xenial main
# deb-src http://reflection.oss.ou.edu/linuxmint/repos serena main upstream import backport
# deb-src http://security.ubuntu.com/ubuntu xenial-security main restricted
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu/ xenial-security main restricted universe multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security multiverse
# deb-src http://security.ubuntu.com/ubuntu xenial-security universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-backports main restricted universe multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial universe
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates main restricted
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates multiverse
# deb-src http://us.archive.ubuntu.com/ubuntu/ xenial-updates universe
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.4pane.co.uk/ubuntu/ xenial main
# deb-src http://www.scootersoftware.com/ bcompare4 non-free
## distribution.
## extensively as that contained in the main release, although it includes
## Major bug fix updates produced after the final release of the
## multiverse WILL NOT receive any review or updates from the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu 
## N.B. software from this repository may not have been tested as
## newer versions of some applications which may provide useful features.
# newer versions of the distribution.
## or updates from the Ubuntu security team.
## 'partner' repository.
## respective vendors as a service to Ubuntu users.
## security team.
# See http://help.ubuntu.com/community/UpgradeNotes for how to upgrade to
## team.
## team, and may not be under a free licence. Please satisfy yourself as to
## team, and may not be under a free licence. Please satisfy yourself as to 
## This software is not part of Ubuntu, but is offered by Canonical and the
## Uncomment the following two lines to add software from Canonical's
## universe WILL NOT receive any review or updates from the Ubuntu security
## your rights to use the software. Also, please note that software in
## your rights to use the software. Also, please note that software in 

The locale settings in effect:

root@HEJ ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Research shows that I need to over-ride LC_COLLATE="en_US.UTF-8" to LC_COLLATE="C.UTF-8" (or better yet LC_ALL=C) in order to get rational output. But there is one further issue here…


If this were only a question of character ordering, then all the "# " should sort together, and all the "##" sort together.

But what appears to be happening is that the "#" and "##" are removed from the sort keys, and I can't believe this a function of collation order.

What else is mucking with my sort keys?

And while we're on the subject of collation order, Where the is the binary order of characters in specific locales documented?, i.e. a human readable lists of each possible character arranged in collated order?

(A pox upon those who did not make the en_US locale definition upward compatible!)

Best Answer

This behaviour is purely a function of collation ordering controlled by your LC_COLLATE locale. Because you have a Unicode locale set, glibc uses the specified Unicode collation order in one of its defined variants, which attempts to be a somewhat "natural" sort.

This ordering is the UTS 10 Unicode Collation Algorithm ordering, with shift trimming of variable collation elements, and using (likely) the default collation element table. In effect, characters like #, but also most other punctuation and whitespace, are treated as less significant than differences between following alphanumeric characters, and only used to break ties. The entire algorithm is defined in some detail in the standard and it gets more complex still.

It is sometimes advised not to set LANG or LC_COLLATE for this reason. You can instead set LC_CTYPE (to UTF-8) and LC_MESSAGES (to your preferred message language), and keep collation at the POSIX default. There are flow-on effects either way to that choice.


On your system, this is probably defined in /usr/share/i18n/locales/iso14651_t1_common, which is included by iso14651_t1, which is included by en_US. Other locale's orderings are defined in the nearby files, commonly based on the same default with localised changes (for example, sv_SE uses the same basis, but reorders ...zåäöø, collapses v and w, etc). This table, selected by LC_COLLATE, is what actually determines the behaviour on your system, and is derived from (a past version of) the Unicode standard. On newer or older systems, using different Unicode versions, the same strings may compare differently.

Other encodings will have their own separate tables that may be entirely unrelated.


You can check the behaviour of your system against the specification by sorting a file containing strings from the comparison tables provided in the UTR:

demark
de‐Luge
death
deluge
☠sad
de-luge
de Luge
☠happy
de‐luge
♡sad
deLuge
de luge
♡happy
de-Luge

(there are both hyphens and hyphen-minuses in those words)

The order you should get is:

death
deluge
de luge
de-luge
de‐luge
deLuge
de Luge
de-Luge
de‐Luge
demark
☠happy
♡happy
☠sad
♡sad

(Some) expository explanation is given for that result in the report:

  • Shifted. The hyphen-minus and hyphen are grouped together, and their
    differences are less significant than the casing differences in the
    letter "l". This grouping results from the fact that they are
    ignorable, but their fourth level differences are according to the
    original primary order, which is more intuitive than Unicode order.
    The symbols ☠ and ♡ are ignored on levels 1-3.

  • Shift-Trimmed. Note how “deLuge” comes between the cased versions with spaces and hyphens. The symbols ☠ and ♡ are ignored on levels 1-3.

It's a bit dense. "Levels 1-3" are different levels of tiebreak weight in the algorithm, with the primary level 1 being the most important differentiator. This is probably more information than you need already, but you can at least determine that it is the specified collation order that creates the result you're seeing.