cat
has a -v
option which converts non-printing characters to their caret notation (which is useful if we don't want the terminal to interpret the control characters literally in cat
output).
But as I understand, the caret notation only applies to non-printing characters in the ASCII alphabet. So what about the non-printing characters in UTF that do not fall in ASCII (e.g., https://www.compart.com/en/unicode/category/Cc)? What notation will cat -v
use to display these?
Best Answer
We can generate a file containing the first 256 Unicode characters in UTF-8 with:
That includes the non-ASCII (C1) controls in Latin-1 Supplement, and also plenty of printing characters.
Now we can
cat -v
it:(I've wrapped that manually so that it's readable)
You can see that it represents U+0080 at the start of the fourth line, which is UTF-8
C2 80
, asM-BM-^@
.M-B
represents the C2 byte: B is 0x42, soM-
represents setting the high bit (i.e. adding 0x80).M-^@
is doing the same for a null byte (meta-ctrl-@) - theM-x
and^x
notation is combined together.The same thing will happen for all non-ASCII codepoints, which will consist entirely of high bytes in UTF-8, or all bytes 128-255 in any other encoding. Different
cat
implementations may have their own behaviour as-v
is not a standardcat
option, but both GNU cat and the common BSD versions behave this way.