Zsh – Creating Strings with Invalid Unicode Characters

unicodezsh

For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?

Best Answer

I assume you mean UTF-8 encoded Unicode characters.

That depends what you mean by invalid.

invalid_byte_sequence=$'\x80\x81'

That's a sequence of bytes that, by itself, isn't valid in UTF-8 encoding (the first byte in a UTF-8 encoded character always has the two highest bits set). That sequence could be seen in the middle of a character though, so it could end-up forming a valid sequence once concatenated to another invalid sequence like $'\xe1'. $'\xe1' or $'\xe1\x80' themselves would also be invalid and could be seen as a truncated character.

other_invalid_byte_sequence=$'\xc2\xc2'

The 0xc2 byte would start a 2-byte character, and 0xc2 cannot be in the middle of a UTF-8 character. So that sequence can never be found in valid UTF-8 text. Same for $'\xc0' or $'\xc1' which are bytes that never appear in the UTF-8 encoding.

For the \uXXXX and \UXXXXXXXX sequences, I assume the current locale's encoding is UTF-8.

non_character=$'\ufffe'

That's one of the 66 currently specified non-characters.

not_valid_anymore=$'\U110000'

Unicode is now restricted to code points up to 0x10FFFF. And the UTF-8 encoding which was originally designed to cover up to 0x7FFFFFFF (perl also supports a variant that goes to 0xFFFFFFFFFFFFFFFF) is now conventionally restricted to that as well.

utf16_surrogate=$'\ud800'

Code points 0xD800 to 0xDFFF are code points reserved for the UTF16 encoding. So the UTF-8 encoding of those code points is invalid.

Now most of the remaining code points are still not assigned in the latest version of Unicode.

unassigned=$'\u378'

Newer versions of Unicode come with new characters specified. For instance Unicode 8.0 (released in June 2015) has ? (U+1F917) which was not assigned in earlier versions.

unicode_8_and_above_only=$'\U1f917'

Some testing with uconv:

$ printf %s $invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: 80 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: 81 Error: Illegal character found
$ printf %s $other_invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: c2 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: c2 Error: Truncated character found
$ printf %s $non_character| uconv -x any-name
\N{<noncharacter-FFFE>}
$ printf %s $not_valid_anymore| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: f4 90 80 80 Error: Illegal character found
$ printf %s $utf16_surrogate | uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: ed a0 80 Error: Illegal character found
$ printf %s $unassigned | uconv -x any-name
\N{<unassigned-0378>}
$ printf %s $unicode_8_and_above_only | uconv -x any-name
\N{<unassigned-1F917>}
$

With GNU grep, you can use grep . to see if it can find a character in the input:

l=(invalid_byte_sequence other_invalid_byte_sequence non_character
  not_valid_anymore utf16_surrogate unassigned unicode_8_and_above_only)
for c ($l) print -r ${(P)c} | grep -q . && print $c

Which for me gives:

non_character
not_valid_anymore
utf16_surrogate
unassigned
unicode_8_and_above_only

That is, my grep still considers some of those invalid, non-characters or not-assigned-yet characters as being (or containing) characters. YMMV for other implementations of grep or other utilities.

Related Solutions

How to print Unicode glyph names for input string

The uniutils package has the program uniname.

$ echo -n …—|uniname
character  byte       UTF-32   encoded as     glyph   name
    0          0  002026   E2 80 A6       …      HORIZONTAL ELLIPSIS
    1          3  002014   E2 80 94       —      EM DASH

How to type arbitrary unicode characters in xterm

xterm doesn't implement a hexadecimal-input feature because all of the text editors which handle UTF-8 provide their own equivalents (emacs, vim and vile, of course, even nano). This could be useful in a shell script, but is not often mentioned. The feature was first implemented in Windows, of course.

To enter multibyte (e.g., UTF-8) characters in xterm, you would use compose sequences. As a special case, the meta key can be used as a sort of shift to get the 128-255 coverage of UTF-8, but aside from that, compose is what works.

gnome-terminal (more properly VTE), also implements compose, although there are some differences.

Best Answer

Related Solutions

How to print Unicode glyph names for input string

How to type arbitrary unicode characters in xterm

Related Question