Short version: collation doesn't really work in command line utilities.
Longer version: the underlying function to compare two strings is strcoll
. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm
constructs this canonical form.
Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):
$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b
As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8
locale. They are, however, present in a Japanese locale.
$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇
〼 303030
〇 3c9b
The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US
, which includes /usr/share/i18n/locales/iso14651_t1_common
. This file doesn't have an entry for U3007
or U303C
, nor are they included in any range that I can find.
I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is
The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.
It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.
You're setting LC_COLLATE
for the cat
command only (which doesn't make use of it), while you need to set it for sort
and uniq
.
Also, you may need to set LC_CTYPE
to something utf-8, otherwise it will cause confusion. I would set LC_ALL
to en_US.UTF-8
uniq -u
only reports unique lines. So, if those single letter words all appear several times, it's normal that they don't show up.
On my system, épée does appear twice:
$ cat american-english british-english | sort | grep -x 'épée'
épée
épée
Maybe you meant sort | uniq
or sort -u
.
Best Answer
The GNU implementation of
uniq
as found on Ubuntu, with-c
, doesn't report counts of contiguous identical lines but counts of contiguous lines that sort the same¹.Most international locales on GNU systems have that bug that many completely unrelated characters have been defined with the same sort order most of them because their sort order is not defined at all. Most other OSes make sure all characters have different sorting order.
(
expr
's=
operator, for arguments that are not numerical, returns 1 if operands sort the same, 0 otherwise).That's the same with
ar_SY.UTF-8
oren_GB.UTF-8
.What you'd need is a locale where those characters have been given a different sorting order. If Ubuntu had locales for the Syriac language, you could expect those characters to have been given a different sorting order, but Ubuntu doesn't have such locales.
You can look at the output of
locale -a
for a list of supported locales. You can enable more locales by runningdpkg-reconfigure locales
asroot
. You can also define more locales manually usinglocaledef
based on the definition files in/usr/share/i18n/locales
, but you'll find no data for the Syriac language there.Note that in:
You're only setting the LC_COLLATE variable for the
cat
command (which doesn't affect the way it outputs the content of the file,cat
doesn't care about collation nor even character encoding as it's not a text utility). You'd want to set it for bothsort
anduniq
. You'd also want to setLC_CTYPE
to a locale that has a UTF-8 charset.As your system doesn't have
syr_SY.utf8
locale, that's the same as using theC
locale (the default locale).Actually, here the C locale or C.UTF-8 is probably the locale you'd want to use.
In those locales, the collation order is based on code point, Unicode code point for C.UTF-8, byte value for C, but that ends up being the same as the UTF-8 character encoding has that property.
So with:
You'd have a LC_CTYPE with UTF-8 as the charset, a collation order based on code point, and the other settings relevant to your region, so for instance error messages in Syriac or Arabic if GNU coreutils
sort
oruniq
messages had been translated in those languages (they haven't yet).If you don't care about those other settings, it's just as easy (and also more portable) to use:
Or
as @isaac has already shown.
¹ note that POSIX compliant
uniq
implementations are not meant to compare strings using the locale's collation algorithm but instead do a byte-to-byte equality comparison. That was further clarified in the 2018 edition of the standard (see the corresponding Austin group bug). But GNUuniq
currently does usestrcoll()
, even underPOSIXLY_CORRECT
; it also has a-i
option for case-insenstive comparison which ironically doesn't use locale information and only works correctly on ASCII input