Some UTF-8 characters not being recognized by grep or sed

greplocalesedunicode

Trying to determine all characters in a file.

The file sample consists of:

a eɪ
abandon əˈbændən
ability əˈbɪləti
able ˈeɪbəl
able ˈeɪbl
abortion əˈbɔrʃən
abortion əˈbɔrʃn
about əˈbaʊt
above əˈbʌv
abroad əˈbrɔd

Confirmed locale is correct:

$ echo $LANG

en_US.UTF-8

A command to take the second field, split by character, then count how many:

$ cat sample | awk '{print $2}' | grep -o . | sort | uniq -c | sort -n

  1 a
  1 æ
  1 i
  1 v
  2 d
  2 t
  3 e
  3 l
  3 ɔ
  3 r
  4 n
  9 b
 11 ə
 17 ɪ

Where is ʃ and ˈ? They aren't combining characters or anything special. Note that other UTF-8 characters are pulled out: ɔ, ə and ɪ, for example.

BTW using sed 's/\(.\)/\1\n/g' has nearly the same results as grep -o ., except it adds a line for '\n'.

Is there something I'm missing? Does grep have a hidden UTF-8 option?

In case it matters I'm using Ubuntu 12.04.2 LTS.

Best Answer

The problem is that sort and uniq are using collation information for the locale. Switching the locale off for the two commands works:

cat sample | awk '{print $2}' | grep -o . | LC_ALL=C sort | LC_ALL=C uniq -c | sort -n
      1 ʊ
      1 ʌ
      1 a
      1 æ
      1 i
      1 v
      2 ʃ
      2 d
      2 t
      3 e
      3 l
      3 ɔ
      3 r
      4 ɪ
      4 n
      9 ˈ
      9 b
     11 ə
Related Question