Trying to determine all characters in a file.
The file sample
consists of:
a eɪ
abandon əˈbændən
ability əˈbɪləti
able ˈeɪbəl
able ˈeɪbl
abortion əˈbɔrʃən
abortion əˈbɔrʃn
about əˈbaʊt
above əˈbʌv
abroad əˈbrɔd
Confirmed locale is correct:
$ echo $LANG
en_US.UTF-8
A command to take the second field, split by character, then count how many:
$ cat sample | awk '{print $2}' | grep -o . | sort | uniq -c | sort -n
1 a
1 æ
1 i
1 v
2 d
2 t
3 e
3 l
3 ɔ
3 r
4 n
9 b
11 ə
17 ɪ
Where is ʃ
and ˈ
? They aren't combining characters or anything special. Note that other UTF-8 characters are pulled out: ɔ
, ə
and ɪ
, for example.
BTW using sed 's/\(.\)/\1\n/g'
has nearly the same results as grep -o .
, except it adds a line for '\n'.
Is there something I'm missing? Does grep have a hidden UTF-8 option?
In case it matters I'm using Ubuntu 12.04.2 LTS
.
Best Answer
The problem is that
sort
anduniq
are using collation information for the locale. Switching the locale off for the two commands works: