UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

character encodingcuttext processingunicode

The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:

The second byte gives the second ASCII character (which is encoded just the same in UTF-8):

$ printf 'ABC' | cut -b 2          
B

but does not give the second of three greek non-ASCII characters in UTF-8 locale:

$ printf 'αβγ' | cut -b 2         
�

That's alright – it's the second byte.
So we look at the second character instead:

$ printf 'αβγ' | cut -c 2 
�

That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:

$ printf 'αβγ' | cut -c 3-4
β

But that's just the same as the bytes 3 to 4:

$ printf 'αβγ' | cut -b 3-4
β

So the -c does not more than the -b for UTF-8.

I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes).
^{(Note the confusing option names.)}

$ printf 'αβγ' | wc -c
6

But it can also count characters with option -m (--chars), which just works:

$ printf 'αβγ' | wc -m
3

So my configuration seems to be ok – but something is special about cut.

Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.

So, what's wrong? And why?

The locale setup looks right for utf8, as far as I can tell:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

The input, byte by byte:

$ printf 'αβγ' | hd 
00000000  ce b1 ce b2 ce b3                                 |......|
00000006

Best Answer

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Some UTF-8 characters not being recognized by grep or sed

The problem is that sort and uniq are using collation information for the locale. Switching the locale off for the two commands works:

cat sample | awk '{print $2}' | grep -o . | LC_ALL=C sort | LC_ALL=C uniq -c | sort -n
      1 ʊ
      1 ʌ
      1 a
      1 æ
      1 i
      1 v
      2 ʃ
      2 d
      2 t
      3 e
      3 l
      3 ɔ
      3 r
      4 ɪ
      4 n
      9 ˈ
      9 b
     11 ə

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Some UTF-8 characters not being recognized by grep or sed

Related Question