The command cut
has an option -c
to work on characters, instead of bytes with the option -b
. But that does not seem to work, in en_US.UTF-8
locale:
The second byte gives the second ASCII character (which is encoded just the same in UTF-8):
$ printf 'ABC' | cut -b 2
B
but does not give the second of three greek non-ASCII characters in UTF-8 locale:
$ printf 'αβγ' | cut -b 2
�
That's alright – it's the second byte.
So we look at the second character instead:
$ printf 'αβγ' | cut -c 2
�
That looks broken.
With some experiments, it turns out that the range 3-4
shows the second character:
$ printf 'αβγ' | cut -c 3-4
β
But that's just the same as the bytes 3 to 4:
$ printf 'αβγ' | cut -b 3-4
β
So the -c
does not more than the -b
for UTF-8.
I'd expect the locale setup is not right for UTF-8, but in comparison, wc
works as expected;
It is often used to count bytes, with option -c
(--bytes
).
(Note the confusing option names.)
$ printf 'αβγ' | wc -c
6
But it can also count characters with option -m
(--chars
), which just works:
$ printf 'αβγ' | wc -m
3
So my configuration seems to be ok – but something is special about cut
.
Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b
and -c
.
So, what's wrong? And why?
The locale setup looks right for utf8, as far as I can tell:
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
The input, byte by byte:
$ printf 'αβγ' | hd
00000000 ce b1 ce b2 ce b3 |......|
00000006
Best Answer
You haven't said which
cut
you're using, but since you've mentioned the GNU long option--characters
I'll assume it's that one. In that case, note this passage frominfo coreutils 'cut invocation'
:(emphasis added)
For the moment, GNU
cut
always works in terms of single-byte "characters", so the behaviour you see is expected.Supporting both the
-b
and-c
options is required by POSIX — they weren't added to GNUcut
because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same-c
has been done in some othercut
implementations, although not FreeBSD's and OS X's at least.This is the historic behaviour of
-c
.-b
was newly added to take over the byte role so that-c
can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNUcut
doesn't even implement the-n
option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.