Zsh – Creating Strings with Invalid Unicode Characters


For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?

Best Answer

I assume you mean UTF-8 encoded Unicode characters.

That depends what you mean by invalid.


That's a sequence of bytes that, by itself, isn't valid in UTF-8 encoding (the first byte in a UTF-8 encoded character always has the two highest bits set). That sequence could be seen in the middle of a character though, so it could end-up forming a valid sequence once concatenated to another invalid sequence like $'\xe1'. $'\xe1' or $'\xe1\x80' themselves would also be invalid and could be seen as a truncated character.


The 0xc2 byte would start a 2-byte character, and 0xc2 cannot be in the middle of a UTF-8 character. So that sequence can never be found in valid UTF-8 text. Same for $'\xc0' or $'\xc1' which are bytes that never appear in the UTF-8 encoding.

For the \uXXXX and \UXXXXXXXX sequences, I assume the current locale's encoding is UTF-8.


That's one of the 66 currently specified non-characters.


Unicode is now restricted to code points up to 0x10FFFF. And the UTF-8 encoding which was originally designed to cover up to 0x7FFFFFFF (perl also supports a variant that goes to 0xFFFFFFFFFFFFFFFF) is now conventionally restricted to that as well.


Code points 0xD800 to 0xDFFF are code points reserved for the UTF16 encoding. So the UTF-8 encoding of those code points is invalid.

Now most of the remaining code points are still not assigned in the latest version of Unicode.


Newer versions of Unicode come with new characters specified. For instance Unicode 8.0 (released in June 2015) has ? (U+1F917) which was not assigned in earlier versions.


Some testing with uconv:

$ printf %s $invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: 80 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: 81 Error: Illegal character found
$ printf %s $other_invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: c2 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: c2 Error: Truncated character found
$ printf %s $non_character| uconv -x any-name
$ printf %s $not_valid_anymore| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: f4 90 80 80 Error: Illegal character found
$ printf %s $utf16_surrogate | uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: ed a0 80 Error: Illegal character found
$ printf %s $unassigned | uconv -x any-name
$ printf %s $unicode_8_and_above_only | uconv -x any-name

With GNU grep, you can use grep . to see if it can find a character in the input:

l=(invalid_byte_sequence other_invalid_byte_sequence non_character
  not_valid_anymore utf16_surrogate unassigned unicode_8_and_above_only)
for c ($l) print -r ${(P)c} | grep -q . && print $c

Which for me gives:


That is, my grep still considers some of those invalid, non-characters or not-assigned-yet characters as being (or containing) characters. YMMV for other implementations of grep or other utilities.

Related Question