Unicode – Heirloom Toolchest tr Error with Multibyte Characters

trunicode

I'm trying to use the tr command from the Heirloom Toolchest to overcome a current limitation of the coreutils implementation, so as to be able to "pump" (with the -dc options) multibyte characters from a "random" generator (/dev/urandom) to the terminal. Noteworthy is the fact that this has been compiled from source on Archbang after having failed to do so using the AUR version(s).

To simplify this, let's pick a character(☠) and figure out its octal value because this is how it must be expressed for toolchest tr:

echo '☠' | hexdump -b            # -b for octal
0000000 342 230 240 012                                                
0000004
echo -e '\0342\0230\0240'        # uses the "0nnn" format, make sure it prints
☠

There is a difference in how the octal value is expressed in Bash with the echo builtin (0nnn) compared to the toolchest tr here (nnn):

The character '\' followed by 1, 2 or 3 octal digits stands for the
character whose byte code is given by those digits. Multibyte
characters can be specified as a sequence of octal bytes.

Let's try it. The -dc option simply deletes the complement of SET1. You specify a single set, and anything from standard input that doesn't contain an element from the set gets discarded:

echo '012345' | /usr/5bin/tr -dc '456'   #sanity check
45                                       #all good

Now these:

echo -e '\0342\0230\0240' | /usr/5bin/tr -dc '\342\230\240'
echo -e '☠' | /usr/5bin/tr -dc '☠'

which should both print one(1)☠, or ultimately the following (much more characters) all produce the same error:

/usr/5bin/tr -dc '\342\230\240' < /dev/urandom

*** Error in `/usr/5bin/tr': double free or corruption (!prev): 0x0000000000d24420 ***

Actually every time the input and SET1 both contain the chosen character the error appears with -dc. The behavior is also the same accross the SysV 3rd, 4th, Posix, Posix2001, or ucb(BSD) versions of the command provided in the toolchest. Sometimes, as with the case of tr -dc '1' < /dev/urandom I get a segfault proper or some few lines of output followed with this:

Error in `/usr/5bin/tr': realloc(): invalid pointer: 0x00007f93ee284010 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x73f8e)[0x7f93ee338f8e]
/usr/lib/libc.so.6(+0x7988e)[0x7f93ee33e88e]
/usr/lib/libc.so.6(realloc+0x1c8)[0x7f93ee342918]
/usr/5bin/tr[0x401a74]
/usr/5bin/tr[0x400e93]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f93ee2e5000]
/usr/5bin/tr[0x400f63]
======= Memory map: ========
00400000-00403000 r-xp 00000000 08:21 1579535                            /usr/5bin/tr
00602000-00603000 rw-p 00002000 08:21 1579535                            /usr/5bin/tr
0067a000-006bc000 rw-p 00000000 00:00 0                                  [heap]
7f93edc6e000-7f93edc84000 r-xp 00000000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93edc84000-7f93ede83000 ---p 00016000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93ede83000-7f93ede84000 rw-p 00015000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93ede84000-7f93ee2c5000 rw-p 00000000 00:00 0 
7f93ee2c5000-7f93ee469000 r-xp 00000000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee469000-7f93ee669000 ---p 001a4000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee669000-7f93ee66d000 r--p 001a4000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee66d000-7f93ee66f000 rw-p 001a8000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee66f000-7f93ee673000 rw-p 00000000 00:00 0 
7f93ee673000-7f93ee694000 r-xp 00000000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee6eb000-7f93ee874000 r--p 00000000 08:21 1448356                    /usr/lib/locale/locale-archive
7f93ee874000-7f93ee877000 rw-p 00000000 00:00 0 
7f93ee891000-7f93ee893000 rw-p 00000000 00:00 0 
7f93ee893000-7f93ee894000 r--p 00020000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee894000-7f93ee895000 rw-p 00021000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee895000-7f93ee896000 rw-p 00000000 00:00 0 
7fffed79c000-7fffed7bd000 rw-p 00000000 00:00 0                          [stack]
7fffed7e9000-7fffed7eb000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Is all that suggesting a compiling error on my part or am I not using this properly?


With the contributed patch we have:

echo -e '\0342\0230\0240' | /home/me/bin/trsc -dc '\342\230\240'
echo -e '☠' | /home/me/bin/trsc -dc '☠'
☠

As we should!! But:

/home/me/bin/trsc -dc '\342\230\240' < /dev/urandom

still remains a mystery as the picked character is not in the output…

Best Answer

I've seen that before. A bug. Try:

--- tr.c        6 Sep 2005 23:04:11 -0000       1.10
+++ tr.c        30 May 2014 09:46:33 -0000
@@ -291,7 +291,6 @@
                if(c<ccnt) code[c] = d;
                if(d<ccnt && sflag) squeez[d] = 1;
        }
-       free(vect);
        while((d = next(&string2)) != NIL) {
                if(sflag) squeez[d] = 1;
                if(string2.max==NIL && (string2.p==NULL || *string2.p==0))

(that was from a quick look a few months ago, while this patch will get you going, I can't guarantee it's right. Apply with patch -l).

Now also note that /dev/urandom provides with a stream of bytes. In UTF-8, not all sequences of bytes map to valid characters. For instance, 0x41 0x81 0x41 is not valid because 0x81 is >= 0x80, so it can only occur in a sequence of 2 or more over 0x80 bytes.

An invalid byte, because it's not in the set of characters that is the complement of ☠, will not be deleted by tr.

Better would probably be:

recode ucs-2..u8 < /dev/urandom | tr -cd ☠

ucs-2 being the characters U+0000 to U+FFFF encoded on 2 bytes per character, /dev/urandom looks more like a stream of ucs-2 characters. (we're missing the characters U+10000 to U+10FFFF though).

But that would still include the D800..DFFF surrogate pair range which mbrtowc(3) will choke on (at least with my version of libc).

Those code point are reserved for the purpose of UTF-16 encoding. d800dc00 for instance is the UTF-16BE encoding of U+10000, but there's no U+D800 character or U+DC00. The UTF-8 encoding of those don't make sense as a character either (even if adjacent).

So you'd need to exclude them first:

perl -ne 'BEGIN{$/=\2;binmode STDOUT,":utf8"}
          $c = unpack("n",$_); if ($c < 0xd800 || $c > 0xdfff) {
            no warnings "utf8"; print chr($c)
          }' < /dev/urandom | tr -cd ☠

If the point is to get a stream of random Unicode characters encoded in UTF-8, best would probably to get a random code point in the allowable range (0..0xd7ff, 0xf000..0x10ffff) and convert that to UTF-8. If you want to base it on /dev/urandom, you could use 3 bytes (24 bits) from it for each code point:

perl -ne 'BEGIN{$/=\3;binmode STDOUT,":utf8"}
          $c = unpack("N","\0$_") * 0x10F800 >> 24;
          $c+=0x800 if $c >= 0xd800;
          do {no warnings "utf8"; print chr($c)}' < /dev/urandom |
  tr -cd ☠