Some UTF-8 characters not being recognized by grep or sed

greplocalesedunicode

Trying to determine all characters in a file.

The file sample consists of:

a eɪ
abandon əˈbændən
ability əˈbɪləti
able ˈeɪbəl
able ˈeɪbl
abortion əˈbɔrʃən
abortion əˈbɔrʃn
about əˈbaʊt
above əˈbʌv
abroad əˈbrɔd

Confirmed locale is correct:

$ echo $LANG

en_US.UTF-8

A command to take the second field, split by character, then count how many:

$ cat sample | awk '{print $2}' | grep -o . | sort | uniq -c | sort -n

  1 a
  1 æ
  1 i
  1 v
  2 d
  2 t
  3 e
  3 l
  3 ɔ
  3 r
  4 n
  9 b
 11 ə
 17 ɪ

Where is ʃ and ˈ? They aren't combining characters or anything special. Note that other UTF-8 characters are pulled out: ɔ, ə and ɪ, for example.

BTW using sed 's/\(.\)/\1\n/g' has nearly the same results as grep -o ., except it adds a line for '\n'.

Is there something I'm missing? Does grep have a hidden UTF-8 option?

In case it matters I'm using Ubuntu 12.04.2 LTS.

Best Answer

The problem is that sort and uniq are using collation information for the locale. Switching the locale off for the two commands works:

cat sample | awk '{print $2}' | grep -o . | LC_ALL=C sort | LC_ALL=C uniq -c | sort -n
      1 ʊ
      1 ʌ
      1 a
      1 æ
      1 i
      1 v
      2 ʃ
      2 d
      2 t
      3 e
      3 l
      3 ɔ
      3 r
      4 ɪ
      4 n
      9 ˈ
      9 b
     11 ə

Related Solutions

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

Grep Command – Search and Replace Using Grep Instead of SED

grep is only meant to (and was only initially) print(ing) the lines matching a pattern. That's what grep means (based on the g/re/p ed command).

Now, some grep implementations have added a few features that encroaches a bit on the role of other commands. For instance, some have some -r/--include/--exclude to perform part of find's job.

GNU grep added a -o option that makes it perform parts of sed's job as it makes it edit the lines being matched.

pcregrep extended it with -o1, -o2... to print what was matched by capture groups. So with that implementation, even though it was not designed for that, you can actually replace:

sed 's/old/new/'

with:

pcregrep --om-separator=new  -o1 -o2 '(.*?)old(.*)'

That doesn't work properly however if the capture groups match the empty string. On an input like:

XoldY
Xold
oldY

it gives:

XnewY
X
Y

You could work around that using even nastier tricks like:

PCREGREP_COLOR= pcregrep --color=always '.*old.*' |
  pcregrep --om-separator=new -o1 -o2 '^..(.+?)old(.+)..' |
  pcregrep -o1 '.(.*).'

That is, prepend and append \e[m (coloring escape sequence) to all matching lines to be sure there is at least one character on either side of old, and strip them afterwards.

Best Answer

Related Solutions

UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

Grep Command – Search and Replace Using Grep Instead of SED

Related Question