Where has the `uniq` or `sort -u` line gone, with some unicode characters

localesorttext processingunicodeuniq

What is going on in the following code snippet? I'm not getting my expected output.

I'd think it was a bug, but it happens for 2 different programs (uniq and sort), so I suspect it is something to do with… well, I don't know what.. hence the question.

The first 3 (of 4) examples work, but the 4th fails!.

I would expect the same behaviour for any and all characters.
ie. to print out 2 lines (from the 3 lines of input)… but in the 4th case, I only get 1 line (for both sort -u and uniq); the two identical lins just vanish!

I've converted the output '\n' to space for compactness of view.

I'm using uniq and sort from (GNU coreutils) 7.4 … running on Ubuntu 10.04.3 LTS desktop.

The script:

{
  locale -k LC_COLLATE
  echo
  for c1 in x 〼 ;do 
    for c2 in z 〇 ;do 
      echo -n "asis   : "; echo -e "$c1\n$c2\n$c2"          |tr '\n' ' ';echo
      echo -n "uniq   : "; echo -e "$c1\n$c2\n$c2" |uniq    |tr '\n' ' ';echo
      echo -n "sort -u: "; echo -e "$c1\n$c2\n$c2" |sort -u |tr '\n' ' ';echo
      echo
    done
    echo
  done
}

The output:

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2081
collate-codeset="UTF-8"

asis   : x z z 
uniq   : x z 
sort -u: x z 

asis   : x 〇 〇 
uniq   : x 〇 
sort -u: 〇 x 


asis   : 〼 z z 
uniq   : 〼 z 
sort -u: 〼 z 

asis   : 〼 〇 〇 
uniq   : 〼 
sort -u: 〼 

# In the last example (of 4) where did the '〇' go? .. U+3007 IDEOGRAPHIC NUMBER ZERO
#

Best Answer

Short version: collation doesn't really work in command line utilities.

Longer version: the underlying function to compare two strings is strcoll. The description isn't very helpful, but the conceptual method of operation is to convert both strings to a canonical form, and then compare the two canonical forms. The function strxfrm constructs this canonical form.

Let's observe the canonical forms of a few strings (with GNU libc, under Debian squeeze):

$ export LC_ALL=en_US.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' b a A à 〼 〇
b d010801020
a c010801020
A c010801090
à 101010102c6b
〼 101010102c6b102c6b102c6b
〇 101010102c6b102c6b102c6b

As you can see, 〼 and 〇 have the same canonical form. I think that's because these characters are not mentioned in the collation tables of the en_US.UTF-8 locale. They are, however, present in a Japanese locale.

$ export LC_ALL=ja_JP.UTF-8
$ perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*", strxfrm($_)) foreach @ARGV' 〼 〇 
〼 303030
〇 3c9b

The source code for the locale data (in Debian squeeze) is in /usr/share/i18n/locales/en_US, which includes /usr/share/i18n/locales/iso14651_t1_common. This file doesn't have an entry for U3007 or U303C, nor are they included in any range that I can find.

I'm not familiar with the rules to build the collation order, but from what I understand, the relevant phrasing is

The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. (…) If no UNDEFINED symbol is specified, and the current coded character set contains characters not specified in this section, the utility shall issue a warning message and place such characters at the end of the character collation order.

It looks like Glibc is instead ignoring characters that aren't specified. I don't know if there's a flaw of my understanding of the POSIX spec, if I missed something in Glibc's locale definition, or if there's a bug in the Glibc locale compiler.

Related Solutions

Shell – Get lines with maximum values in the column using awk, uniq and sort

I think you want

cat myfile.txt| sort -k1 -r | sort --unique --stable -k2,3

(see my comment regarding cat above). The first sort will put the newest dates to the top. The second sort will sort by user+access, but, by giving --stable, will keep the previous order of lines that have the same user+access combination, i.e. newest still on top. Giving --unique, only the first line of a run with equal user+access combination is shown. (You can replace it with | uniq -f1, I'd think, if it happens to be a GNU extension your sort doesn't have.)

Text Processing – Why is uniq Ignoring Unicode and Single Letter Lines?

You're setting LC_COLLATE for the cat command only (which doesn't make use of it), while you need to set it for sort and uniq.

Also, you may need to set LC_CTYPE to something utf-8, otherwise it will cause confusion. I would set LC_ALL to en_US.UTF-8

uniq -u only reports unique lines. So, if those single letter words all appear several times, it's normal that they don't show up.

On my system, épée does appear twice:

$ cat american-english british-english | sort | grep -x 'épée'
épée
épée

Maybe you meant sort | uniq or sort -u.

Best Answer

Related Solutions

Shell – Get lines with maximum values in the column using awk, uniq and sort

Text Processing – Why is uniq Ignoring Unicode and Single Letter Lines?

Related Question