Short version: collation doesn't really work in command line utilities.
Longer version: the underlying function to compare two strings is strcoll
. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm
constructs this canonical form.
Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):
$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b
As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8
locale. They are, however, present in a Japanese locale.
$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇
〼 303030
〇 3c9b
The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US
, which includes /usr/share/i18n/locales/iso14651_t1_common
. This file doesn't have an entry for U3007
or U303C
, nor are they included in any range that I can find.
I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is
The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.
It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.
A solution in perl
:
$ perl -anle '
push @h, [$F[-1],$_];
END {
print for map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_->[1],hex($_->[0])] } @h;
}
' file
4 jjk 7
5 hhf 25
2 ukr 9f
3 ezh ae
1 hdh d12
Explanation
While processing file, we create an array of array @h
, each of its element is an array reference [$F[-1],$_]
, with first element is the hex value to compare, and the second element is the whole line.
In END
block, we use Schwartzian transform:
With each element of @h
, create an anonymous array, contains the whole line ( $_->[1]
the second element of each array ref in @h
) and the hex value to compare hex($_->[0])]
Sort above array base on the hex value $a->[1] <=> $b->[1]
Get the first element of each array ref in sorted array map { $_->[0] }
then print the result.
Update
With @Joseph R's suggestion, without using Schwartzian Transform:
$ perl -anle '
push @h, [hex($F[-1]),$_];
END {
print $_->[1] for
sort { $a->[0] <=> $b->[0] } @h;
}
' file
Update 2
After reading stefan's comment, I think this can call direct
:
$ perl -e '
print sort {hex((split(/\s+/,$a))[-1]) <=> hex((split(/\s+/,$b))[-1])} <>;
' file
4 jjk 7
5 hhf 25
2 ukr 9f
3 ezh ae
1 hdh d12
Best Answer
Sorting algorithms in modern locales are quite complex.
Each character (actually collating element which could consist of a sequence of several characters like the Czech
ch
) is given a number of collating weights that decide of their sorting order.When comparing two strings, the first weight of all the characters is used first, and other weights are used later to decide ties if the two strings sorted the same with the first weights.
For instance, in many locales,
e
,é
andE
have the same primary weight (they are of the same equivalence class, they all match[=e=]
).So, when comparing for instance
echo
,été
andEnter
, in the first pass,e
,é
andE
having the same primary weight, it's the second character that will determine the order (c
beforen
beforet
).When comparing
été
,Été
,Ete
, after the first pass, they all sort the same, so we use the second pass using the secondary weight. In typical GNU locales, the second weight for latin script characters is used to prioritise accents (no accent comes first, then acute, grave, breve, circumflex...). We'll then need to use the third weight to decide betweenété
andÉté
and that will be based on case (lowercase first in most locales). There are even characters that end up sorting the same because all their weights are identical.That is used to sort text in a similar way as dictionaries do, as a human would do.
In a dictionary, you'll find that space and most punctuation characters are also ignored. For instance
de facto
sorts in betweendebut
anddevoid
. The first weight of the space character is IGNORE.On a GNU system, you'll find the core collating rules defined in the
/usr/share/i18n/locales/iso14651_t1_common
(the path may vary with the distribution). In there, you'll see:Which illustrate what we've just said. Both space and
>
have their first 3 weights set toIGNORE
. It's only for strings sorting the same for the first 3 weights that their relative order will be considered (>
before space as that<h>
would list before the unspecified collating symbol<U0020>
).In locales that define
UPPERCASE_FIRST
(like/usr/share/i18n/locales/tr_TR
), upper case letters will come first (in the 3rd pass). Same withDIACRIT_FORWARD
where some locales likede_DE
can decide to reverse the order of diacritics for the 2nd pass.>
and>>
sort the same in the 1st, 2nd and 3rd pass. In the 4th,>
sorts before>>
as the empty string sorts before everything.>>b
sorts afterb
because they sort the same in the first 3 passes, but in the fourth pass,b
is IGNORE, so>
is greater. It's less thanc
in the first pass (>
ignored, andb
beforec
)... You get the idea.Now, if you look at the
C
locale definition. It's a lot simpler. There's only one weight, and the weight is based on the codepoint value from U+0000 to U+10FFFF. SoSPC
(U+0020) sorts before>
(U+003E), which sorts beforeb
(U+0062) which sorts beforec
(U+0063). No character is ever ignored.Note that with the GNU libc at least, that order defined in the C locale definition file is ignored when it comes to comparison functions (
strcoll()
and co. as used bysort
). Regardless of the value ofLC_CTYPE
, withLC_COLLATE=C
,strcoll()
is equivalent tostrcmp()
. As in the comparison is always on the byte value, even if those bytes correspond to characters whose unicode codepoint sort the other way round (like 0xA4 U+20AC EURO SIGN vs A5 U+00A5 YEN SIGN in the ISO-8859-15 charset), soLC_ALL=C sort
andLC_COLLATE=C sort
(providedLC_ALL
is not otherwise set) will have the same effect there.