UTF-8 – Can Not Use `cut -c` with UTF-8 Characters?

character encodingcuttext processingunicode

The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:

The second byte gives the second ASCII character (which is encoded just the same in UTF-8):

$ printf 'ABC' | cut -b 2          
B

but does not give the second of three greek non-ASCII characters in UTF-8 locale:

$ printf 'αβγ' | cut -b 2         
�

That's alright – it's the second byte.
So we look at the second character instead:

$ printf 'αβγ' | cut -c 2 
�

That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:

$ printf 'αβγ' | cut -c 3-4
β

But that's just the same as the bytes 3 to 4:

$ printf 'αβγ' | cut -b 3-4
β

So the -c does not more than the -b for UTF-8.

I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes).
(Note the confusing option names.)

$ printf 'αβγ' | wc -c
6

But it can also count characters with option -m (--chars), which just works:

$ printf 'αβγ' | wc -m
3

So my configuration seems to be ok – but something is special about cut.

Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.

So, what's wrong? And why?

The locale setup looks right for utf8, as far as I can tell:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

The input, byte by byte:

$ printf 'αβγ' | hd 
00000000  ce b1 ce b2 ce b3                                 |......|
00000006

Best Answer

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’

Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.


Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.