My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
I've seen that before. A bug. Try:
--- tr.c 6 Sep 2005 23:04:11 -0000 1.10
+++ tr.c 30 May 2014 09:46:33 -0000
@@ -291,7 +291,6 @@
if(c<ccnt) code[c] = d;
if(d<ccnt && sflag) squeez[d] = 1;
}
- free(vect);
while((d = next(&string2)) != NIL) {
if(sflag) squeez[d] = 1;
if(string2.max==NIL && (string2.p==NULL || *string2.p==0))
(that was from a quick look a few months ago, while this patch will get you going, I can't guarantee it's right. Apply with patch -l
).
Now also note that /dev/urandom
provides with a stream of bytes. In UTF-8, not all sequences of bytes map to valid characters. For instance, 0x41 0x81 0x41 is not valid because 0x81
is >=
0x80, so it can only occur in a sequence of 2 or more over 0x80 bytes.
An invalid byte, because it's not in the set of characters that is the complement of ☠, will not be deleted by tr
.
Better would probably be:
recode ucs-2..u8 < /dev/urandom | tr -cd ☠
ucs-2 being the characters U+0000 to U+FFFF encoded on 2 bytes per character, /dev/urandom
looks more like a stream of ucs-2 characters. (we're missing the characters U+10000 to U+10FFFF though).
But that would still include the D800..DFFF surrogate pair range
which mbrtowc(3)
will choke on (at least with my version of libc).
Those code point are reserved for the purpose of UTF-16 encoding. d800dc00 for instance is the UTF-16BE encoding of U+10000, but there's no U+D800 character or U+DC00. The UTF-8 encoding of those don't make sense as a character either (even if adjacent).
So you'd need to exclude them first:
perl -ne 'BEGIN{$/=\2;binmode STDOUT,":utf8"}
$c = unpack("n",$_); if ($c < 0xd800 || $c > 0xdfff) {
no warnings "utf8"; print chr($c)
}' < /dev/urandom | tr -cd ☠
If the point is to get a stream of random Unicode characters encoded in UTF-8, best would probably to get a random code point in the allowable range (0..0xd7ff, 0xf000..0x10ffff) and convert that to UTF-8. If you want to base it on /dev/urandom
, you could use 3 bytes (24 bits) from it for each code point:
perl -ne 'BEGIN{$/=\3;binmode STDOUT,":utf8"}
$c = unpack("N","\0$_") * 0x10F800 >> 24;
$c+=0x800 if $c >= 0xd800;
do {no warnings "utf8"; print chr($c)}' < /dev/urandom |
tr -cd ☠
Best Answer
Short:
setlocale
,wcrtomb
andwcsrtombs
)