Unicode – Heirloom Toolchest tr Error with Multibyte Characters

trunicode

I'm trying to use the tr command from the Heirloom Toolchest to overcome a current limitation of the coreutils implementation, so as to be able to "pump" (with the -dc options) multibyte characters from a "random" generator (/dev/urandom) to the terminal. Noteworthy is the fact that this has been compiled from source on Archbang after having failed to do so using the AUR version(s).

To simplify this, let's pick a character(☠) and figure out its octal value because this is how it must be expressed for toolchest tr:

echo '☠' | hexdump -b            # -b for octal
0000000 342 230 240 012                                                
0000004
echo -e '\0342\0230\0240'        # uses the "0nnn" format, make sure it prints
☠

There is a difference in how the octal value is expressed in Bash with the echo builtin (0nnn) compared to the toolchest tr here (nnn):

The character '\' followed by 1, 2 or 3 octal digits stands for the
character whose byte code is given by those digits. Multibyte
characters can be specified as a sequence of octal bytes.

Let's try it. The -dc option simply deletes the complement of SET1. You specify a single set, and anything from standard input that doesn't contain an element from the set gets discarded:

echo '012345' | /usr/5bin/tr -dc '456'   #sanity check
45                                       #all good

Now these:

echo -e '\0342\0230\0240' | /usr/5bin/tr -dc '\342\230\240'
echo -e '☠' | /usr/5bin/tr -dc '☠'

which should both print one(1)☠, or ultimately the following (much more characters) all produce the same error:

/usr/5bin/tr -dc '\342\230\240' < /dev/urandom

*** Error in `/usr/5bin/tr': double free or corruption (!prev): 0x0000000000d24420 ***

Actually every time the input and SET1 both contain the chosen character the error appears with -dc. The behavior is also the same accross the SysV 3rd, 4th, Posix, Posix2001, or ucb(BSD) versions of the command provided in the toolchest. Sometimes, as with the case of tr -dc '1' < /dev/urandom I get a segfault proper or some few lines of output followed with this:

Error in `/usr/5bin/tr': realloc(): invalid pointer: 0x00007f93ee284010 ***
======= Backtrace: =========
/usr/lib/libc.so.6(+0x73f8e)[0x7f93ee338f8e]
/usr/lib/libc.so.6(+0x7988e)[0x7f93ee33e88e]
/usr/lib/libc.so.6(realloc+0x1c8)[0x7f93ee342918]
/usr/5bin/tr[0x401a74]
/usr/5bin/tr[0x400e93]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f93ee2e5000]
/usr/5bin/tr[0x400f63]
======= Memory map: ========
00400000-00403000 r-xp 00000000 08:21 1579535                            /usr/5bin/tr
00602000-00603000 rw-p 00002000 08:21 1579535                            /usr/5bin/tr
0067a000-006bc000 rw-p 00000000 00:00 0                                  [heap]
7f93edc6e000-7f93edc84000 r-xp 00000000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93edc84000-7f93ede83000 ---p 00016000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93ede83000-7f93ede84000 rw-p 00015000 08:21 1448153                    /usr/lib/libgcc_s.so.1
7f93ede84000-7f93ee2c5000 rw-p 00000000 00:00 0 
7f93ee2c5000-7f93ee469000 r-xp 00000000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee469000-7f93ee669000 ---p 001a4000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee669000-7f93ee66d000 r--p 001a4000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee66d000-7f93ee66f000 rw-p 001a8000 08:21 1440453                    /usr/lib/libc-2.19.so
7f93ee66f000-7f93ee673000 rw-p 00000000 00:00 0 
7f93ee673000-7f93ee694000 r-xp 00000000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee6eb000-7f93ee874000 r--p 00000000 08:21 1448356                    /usr/lib/locale/locale-archive
7f93ee874000-7f93ee877000 rw-p 00000000 00:00 0 
7f93ee891000-7f93ee893000 rw-p 00000000 00:00 0 
7f93ee893000-7f93ee894000 r--p 00020000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee894000-7f93ee895000 rw-p 00021000 08:21 1440340                    /usr/lib/ld-2.19.so
7f93ee895000-7f93ee896000 rw-p 00000000 00:00 0 
7fffed79c000-7fffed7bd000 rw-p 00000000 00:00 0                          [stack]
7fffed7e9000-7fffed7eb000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Is all that suggesting a compiling error on my part or am I not using this properly?

With the contributed patch we have:

echo -e '\0342\0230\0240' | /home/me/bin/trsc -dc '\342\230\240'
echo -e '☠' | /home/me/bin/trsc -dc '☠'
☠

As we should!! But:

/home/me/bin/trsc -dc '\342\230\240' < /dev/urandom

still remains a mystery as the picked character is not in the output…

Best Answer

I've seen that before. A bug. Try:

--- tr.c        6 Sep 2005 23:04:11 -0000       1.10
+++ tr.c        30 May 2014 09:46:33 -0000
@@ -291,7 +291,6 @@
                if(c<ccnt) code[c] = d;
                if(d<ccnt && sflag) squeez[d] = 1;
        }
-       free(vect);
        while((d = next(&string2)) != NIL) {
                if(sflag) squeez[d] = 1;
                if(string2.max==NIL && (string2.p==NULL || *string2.p==0))

(that was from a quick look a few months ago, while this patch will get you going, I can't guarantee it's right. Apply with patch -l).

Now also note that /dev/urandom provides with a stream of bytes. In UTF-8, not all sequences of bytes map to valid characters. For instance, 0x41 0x81 0x41 is not valid because 0x81 is >= 0x80, so it can only occur in a sequence of 2 or more over 0x80 bytes.

An invalid byte, because it's not in the set of characters that is the complement of ☠, will not be deleted by tr.

Better would probably be:

recode ucs-2..u8 < /dev/urandom | tr -cd ☠

ucs-2 being the characters U+0000 to U+FFFF encoded on 2 bytes per character, /dev/urandom looks more like a stream of ucs-2 characters. (we're missing the characters U+10000 to U+10FFFF though).

But that would still include the D800..DFFF surrogate pair range which mbrtowc(3) will choke on (at least with my version of libc).

Those code point are reserved for the purpose of UTF-16 encoding. d800dc00 for instance is the UTF-16BE encoding of U+10000, but there's no U+D800 character or U+DC00. The UTF-8 encoding of those don't make sense as a character either (even if adjacent).

So you'd need to exclude them first:

perl -ne 'BEGIN{$/=\2;binmode STDOUT,":utf8"}
          $c = unpack("n",$_); if ($c < 0xd800 || $c > 0xdfff) {
            no warnings "utf8"; print chr($c)
          }' < /dev/urandom | tr -cd ☠

If the point is to get a stream of random Unicode characters encoded in UTF-8, best would probably to get a random code point in the allowable range (0..0xd7ff, 0xf000..0x10ffff) and convert that to UTF-8. If you want to base it on /dev/urandom, you could use 3 bytes (24 bits) from it for each code point:

perl -ne 'BEGIN{$/=\3;binmode STDOUT,":utf8"}
          $c = unpack("N","\0$_") * 0x10F800 >> 24;
          $c+=0x800 if $c >= 0xd800;
          do {no warnings "utf8"; print chr($c)}' < /dev/urandom |
  tr -cd ☠

TL;DR: Nope.

utf8 doesn't refer to an IANA character set since it drops the - character.
IANA character set names are case INsensitive.
Therefore, the following all refer to RFC3629: UTF-8, a transformation format of ISO 10646:
- UTF-8
- utf-8
- uTf-8 (Note all have a hyphen)
There is a case-sensitive alias of the above name: csUTF8

The details

POSIX.1-2017, section 8.2 Internationalization Variables

If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

But while POSIX.1 leaves the details implementation defined, IANA has something to say about it.

RFC2978 IANA Charset Registration Procedures

2.3. Naming Requirements defines a character set primary name:

 mime-charset = 1*mime-charset-chars
 mime-charset-chars = ALPHA / DIGIT /
            "!" / "#" / "$" / "%" / "&" /
            "'" / "+" / "-" / "^" / "_" /
            "`" / "{" / "}" / "~"
 ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
 DIGIT        = "0".."9"    ; Numeric digit

Note the Case insensitive ASCII Letter.

Interestingly, this means that ^-^ is a happy but valid character set name.

IANA Character Sets

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters. [emphasis mine]

IANA lists the character set as UTF-8.

While utf-8 (or uTf-8) is an official name for an IANA character set name, utf8 (sans hyphen) is not a IANA character set name.

Note that there is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

If it's not IANA, where does `utf8` likely come from?

glibc's _nl_normalize_codeset() does the following:

Only passes characters or a digits (goodbye hyphen)

Converts characters to lowercase

for (cnt = 0; cnt < name_len; ++cnt)
  if (__isalpha_l ((unsigned char) codeset[cnt], locale))
    *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
  else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
    *wp++ = codeset[cnt];

The code comment incorrectly says:

There is no standard for the codeset names.

This comment doesn't seem cognisant of RFC2978 IANA Charset Registration Procedures, 2.3. Naming Requirements.

Best Answer

Related Solutions

Debian – Swedish unicode characters in xdm / xlogin

Is the `utf8` in `en_US.utf8` a canonical character set