What is going on in the following code snippet? I'm not getting my expected output.
I'd think it was a bug, but it happens for 2 different programs (uniq and sort), so I suspect it is something to do with… well, I don't know what.. hence the question.
The first 3 (of 4) examples work, but the 4th fails!.
I would expect the same behaviour for any and all characters.
ie. to print out 2 lines (from the 3 lines of input)… but in the 4th case, I only get 1 line (for both sort -u
and uniq
); the two identical lins just vanish!
I've converted the output '\n' to space for compactness of view.
I'm using uniq and sort from (GNU coreutils) 7.4 … running on Ubuntu 10.04.3 LTS desktop.
The script:
{
locale -k LC_COLLATE
echo
for c1 in x 〼 ;do
for c2 in z 〇 ;do
echo -n "asis : "; echo -e "$c1\n$c2\n$c2" |tr '\n' ' ';echo
echo -n "uniq : "; echo -e "$c1\n$c2\n$c2" |uniq |tr '\n' ' ';echo
echo -n "sort -u: "; echo -e "$c1\n$c2\n$c2" |sort -u |tr '\n' ' ';echo
echo
done
echo
done
}
The output:
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2081
collate-codeset="UTF-8"
asis : x z z
uniq : x z
sort -u: x z
asis : x 〇 〇
uniq : x 〇
sort -u: 〇 x
asis : 〼 z z
uniq : 〼 z
sort -u: 〼 z
asis : 〼 〇 〇
uniq : 〼
sort -u: 〼
# In the last example (of 4) where did the '〇' go? .. U+3007 IDEOGRAPHIC NUMBER ZERO
#
Best Answer
Short version: collation doesn't really work in command line utilities.
Longer version: the underlying function to compare two strings is
strcoll
. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The functionstrxfrm
constructs this canonical form.Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):
As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the
en_US.UTF-8
locale. They are, however, present in a Japanese locale.The source code for the locale data (in Debian squeeze) is in
/usr/share/i18n/locales/en_US
, which includes/usr/share/i18n/locales/iso14651_t1_common
. This file doesn't have an entry forU3007
orU303C
, nor are they included in any range that I can find.I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is
It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.