Zsh – Creating Strings with Invalid Unicode Characters

unicodezsh

For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?

Best Answer

I assume you mean UTF-8 encoded Unicode characters.

That depends what you mean by invalid.

invalid_byte_sequence=$'\x80\x81'

That's a sequence of bytes that, by itself, isn't valid in UTF-8 encoding (the first byte in a UTF-8 encoded character always has the two highest bits set). That sequence could be seen in the middle of a character though, so it could end-up forming a valid sequence once concatenated to another invalid sequence like $'\xe1'. $'\xe1' or $'\xe1\x80' themselves would also be invalid and could be seen as a truncated character.

other_invalid_byte_sequence=$'\xc2\xc2'

The 0xc2 byte would start a 2-byte character, and 0xc2 cannot be in the middle of a UTF-8 character. So that sequence can never be found in valid UTF-8 text. Same for $'\xc0' or $'\xc1' which are bytes that never appear in the UTF-8 encoding.

For the \uXXXX and \UXXXXXXXX sequences, I assume the current locale's encoding is UTF-8.

non_character=$'\ufffe'

That's one of the 66 currently specified non-characters.

not_valid_anymore=$'\U110000'

Unicode is now restricted to code points up to 0x10FFFF. And the UTF-8 encoding which was originally designed to cover up to 0x7FFFFFFF (perl also supports a variant that goes to 0xFFFFFFFFFFFFFFFF) is now conventionally restricted to that as well.

utf16_surrogate=$'\ud800'

Code points 0xD800 to 0xDFFF are code points reserved for the UTF16 encoding. So the UTF-8 encoding of those code points is invalid.

Now most of the remaining code points are still not assigned in the latest version of Unicode.

unassigned=$'\u378'

Newer versions of Unicode come with new characters specified. For instance Unicode 8.0 (released in June 2015) has ? (U+1F917) which was not assigned in earlier versions.

unicode_8_and_above_only=$'\U1f917'

Some testing with uconv:

$ printf %s $invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: 80 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: 81 Error: Illegal character found
$ printf %s $other_invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: c2 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: c2 Error: Truncated character found
$ printf %s $non_character| uconv -x any-name
\N{<noncharacter-FFFE>}
$ printf %s $not_valid_anymore| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: f4 90 80 80 Error: Illegal character found
$ printf %s $utf16_surrogate | uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: ed a0 80 Error: Illegal character found
$ printf %s $unassigned | uconv -x any-name
\N{<unassigned-0378>}
$ printf %s $unicode_8_and_above_only | uconv -x any-name
\N{<unassigned-1F917>}
$

With GNU grep, you can use grep . to see if it can find a character in the input:

l=(invalid_byte_sequence other_invalid_byte_sequence non_character
  not_valid_anymore utf16_surrogate unassigned unicode_8_and_above_only)
for c ($l) print -r ${(P)c} | grep -q . && print $c

Which for me gives:

non_character
not_valid_anymore
utf16_surrogate
unassigned
unicode_8_and_above_only

That is, my grep still considers some of those invalid, non-characters or not-assigned-yet characters as being (or containing) characters. YMMV for other implementations of grep or other utilities.

Related Question