For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?
Zsh – Creating Strings with Invalid Unicode Characters
unicodezsh
unicodezsh
For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?
Best Answer
I assume you mean UTF-8 encoded Unicode characters.
That depends what you mean by invalid.
That's a sequence of bytes that, by itself, isn't valid in UTF-8 encoding (the first byte in a UTF-8 encoded character always has the two highest bits set). That sequence could be seen in the middle of a character though, so it could end-up forming a valid sequence once concatenated to another invalid sequence like
$'\xe1'
.$'\xe1'
or$'\xe1\x80'
themselves would also be invalid and could be seen as a truncated character.The 0xc2 byte would start a 2-byte character, and 0xc2 cannot be in the middle of a UTF-8 character. So that sequence can never be found in valid UTF-8 text. Same for
$'\xc0'
or$'\xc1'
which are bytes that never appear in the UTF-8 encoding.For the
\uXXXX
and\UXXXXXXXX
sequences, I assume the current locale's encoding is UTF-8.That's one of the 66 currently specified non-characters.
Unicode is now restricted to code points up to 0x10FFFF. And the UTF-8 encoding which was originally designed to cover up to 0x7FFFFFFF (
perl
also supports a variant that goes to 0xFFFFFFFFFFFFFFFF) is now conventionally restricted to that as well.Code points 0xD800 to 0xDFFF are code points reserved for the UTF16 encoding. So the UTF-8 encoding of those code points is invalid.
Now most of the remaining code points are still not assigned in the latest version of Unicode.
Newer versions of Unicode come with new characters specified. For instance Unicode 8.0 (released in June 2015) has ? (U+1F917) which was not assigned in earlier versions.
Some testing with
uconv
:With GNU
grep
, you can usegrep .
to see if it can find a character in the input:Which for me gives:
That is, my
grep
still considers some of those invalid, non-characters or not-assigned-yet characters as being (or containing) characters. YMMV for other implementations ofgrep
or other utilities.