Short version: collation doesn't really work in command line utilities.
Longer version: the underlying function to compare two strings is strcoll
. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm
constructs this canonical form.
Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):
$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b
As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8
locale. They are, however, present in a Japanese locale.
$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇
〼 303030
〇 3c9b
The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US
, which includes /usr/share/i18n/locales/iso14651_t1_common
. This file doesn't have an entry for U3007
or U303C
, nor are they included in any range that I can find.
I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is
The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.
It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.
The problem is local $/ = undef
. It causes perl
to read entire file in to @ARGV
array, meaning it contains only one element, so sort
can not sort it (because you are sorting an array with only one element). I expect the output must be the same with your beginning data (I also use Ubuntu 12.04 LTS, perl version 5.14.2
:
$ perl -le 'local $/ = undef;print ++$i for <>' < cat
1
$ perl -le 'print ++$i for <>' < cat
1
2
3
4
5
6
7
8
9
If you remove local $/ = undef
, perl sort
will proceduce same output with the shell sort with LC_ALL=C
:
$ perl -e 'print sort <>' < data
Uber
peach
péché
pêche
sin
war
wird
wär
Über
Note
Without use locale
, perl
ignores your current locale settings. Perl comparison operators ("lt", "le", "cmp", "ge", and "gt")
use LC_COLLATE
(when LC_ALL
absented), and sort
is also effected because it use cmp
by default.
You can get current LC_COLLATE
value:
$ perl -MPOSIX=setlocale -le 'print setlocale(LC_COLLATE)'
en_US.UTF-8
Best Answer
Sorting depends on the locale; specifically, it depends on
$LC_COLLATE
(possibly overridden by$LC_ALL
), falling back to$LANG
if it doesn't exist. The commandlocale
will show you what values you're effectively working with. Seeman 3 strcoll
,man 3 setlocale
, etc.LC_COLLATE=C
(orPOSIX
or no locale at all) results in a strict byte-by-byte comparison.LC_COLLATE=en_US.utf8
results in an alphabetical-equivalence sort, with punctuation ignored and characters within the same equivalence class treated equally.