Cat – Using `cat -v` for Non-Printing Non-ASCII UTF Characters

catcontrol-charactersunicode

cat has a -v option which converts non-printing characters to their caret notation (which is useful if we don't want the terminal to interpret the control characters literally in cat output).

But as I understand, the caret notation only applies to non-printing characters in the ASCII alphabet. So what about the non-printing characters in UTF that do not fall in ASCII (e.g., https://www.compart.com/en/unicode/category/Cc)? What notation will cat -v use to display these?

Best Answer

We can generate a file containing the first 256 Unicode characters in UTF-8 with:

python3 -c 'for x in range(0,255): print(chr(x), end="")' > unicode-file

That includes the non-ASCII (C1) controls in Latin-1 Supplement, and also plenty of printing characters.

Now we can cat -v it:

^@^A^B^C^D^E^F^G^H
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ !"#$%&'()*+,-./0123456789:;
<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^?
M-BM-^@M-BM-^AM-BM-^BM-BM-^CM-BM-^DM-BM-^EM-BM-^FM-BM-^GM-BM-^HM-BM-^I
M-BM-^JM-BM-^KM-BM-^LM-BM-^MM-BM-^NM-BM-^OM-BM-^PM-BM-^QM-BM-^RM-BM-^S
M-BM-^TM-BM-^UM-BM-^VM-BM-^WM-BM-^XM-BM-^YM-BM-^ZM-BM-^[M-BM-^\M-BM-^]
M-BM-^^M-BM-^_M-BM- M-BM-!M-BM-"M-BM-#M-BM-$M-BM-%M-BM-&M-BM-'M-BM-(M-B
M-)M-BM-*M-BM-+M-BM-,M-BM--M-BM-.M-BM-/M-BM-0M-BM-1M-BM-2M-BM-3M-BM-4M-B
M-5M-BM-6M-BM-7M-BM-8M-BM-9M-BM-:M-BM-;M-BM-<M-BM-=M-BM->M-BM-?M-CM-^@
M-CM-^AM-CM-^BM-CM-^CM-CM-^DM-CM-^EM-CM-^FM-CM-^GM-CM-^HM-CM-^IM-CM-^J
M-CM-^KM-CM-^LM-CM-^MM-CM-^NM-CM-^OM-CM-^PM-CM-^QM-CM-^RM-CM-^SM-CM-^T
M-CM-^UM-CM-^VM-CM-^WM-CM-^XM-CM-^YM-CM-^ZM-CM-^[M-CM-^\M-CM-^]M-CM-^^
M-CM-^_M-CM- M-CM-!M-CM-"M-CM-#M-CM-$M-CM-%M-CM-&M-CM-'M-CM-(M-CM-)M-C
M-*M-CM-+M-CM-,M-CM--M-CM-.M-CM-/M-CM-0M-CM-1M-CM-2M-CM-3M-CM-4M-CM-5M-C
M-6M-CM-7M-CM-8M-CM-9M-CM-:M-CM-;M-CM-<M-CM-=M-CM->

(I've wrapped that manually so that it's readable)

You can see that it represents U+0080 at the start of the fourth line, which is UTF-8 C2 80, as M-BM-^@. M-B represents the C2 byte: B is 0x42, so M- represents setting the high bit (i.e. adding 0x80). M-^@ is doing the same for a null byte (meta-ctrl-@) - the M-x and ^x notation is combined together.

The same thing will happen for all non-ASCII codepoints, which will consist entirely of high bytes in UTF-8, or all bytes 128-255 in any other encoding. Different cat implementations may have their own behaviour as -v is not a standard cat option, but both GNU cat and the common BSD versions behave this way.

Mention #1 - LinuxFromScratch project

One place that it's mentioned is in the Linux From Scratch project. I found this page titled: /etc/issue (Customizing your logon).

excerpt

The /etc/issue file is a plain text file which will also accept certain Escape sequences (see below) in order to insert information about the system. There is also the file issue.net which can be used when logging on remotely. ssh however, will only use it if you set the option in the configuration file and will also not interpret the escape sequences shown below.

Mention #2 - SecurityFocus Forum post

As additional evidence that this is not possible there is this excerpt from a forum post titled: Re: ssh and banners Aug 18 2009 01:20PM, that discusses the function that implements the printing of the banner in OpenSSH.

excerpt

After doing some more digging, I found that there is a function in the ssh source (specifically sshconnect2.c) called "input_userauth_banner" that displays the banner from the server. The text of the banner is now being filtered through another function called "strnvis" that encodes non-printable ascii characters as printable text, ie: octal codes. This is why the ansi escape sequence is displayed as "\033[". The documentation for strnvis doesn't mention any security issues, only "unexpected behavior" that could be associated with non-printable characters.

Mention #3 - OpenSSH Release Notes + RFC's

Lastly I encourage you to look through the release notes for OpenSSH. They're here as well as the RFC's that govern the SSH v1 & v2 specifications.

http://www.openssh.com/txt/

This RFC covers some of the behavior of the Banner feature. This section "5.4. Banner Message" covers the details of why this isn't allowed. This paragraph is where is says this is explicitly disallowed.

excerpt

If the 'message' string is displayed, control character filtering, discussed in [SSH-ARCH], SHOULD be used to avoid attacks by sending terminal control characters.

Best Answer

Related Solutions

How to print Unicode glyph names for input string

SSH – Non-ASCII Printable Characters in SSHD Banner

Mention #1 - LinuxFromScratch project

Mention #2 - SecurityFocus Forum post

Mention #3 - OpenSSH Release Notes + RFC's

Additional references (per @hildred)

Related Question