C/C++ Programming – Determine Terminal Character Encoding

asciicharacter encodingterminal-emulatorterminfounicode

I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like \261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the \uxxxx escape sequence.

So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.

Best Answer

Short:

terminfo won't take you there, won't help
there is no reliable way to determine what encoding a terminal actually uses
starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
the user has to know what locale is appropriate and what encoding the terminal can do
the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Unicode – Heirloom Toolchest tr Error with Multibyte Characters

I've seen that before. A bug. Try:

--- tr.c        6 Sep 2005 23:04:11 -0000       1.10
+++ tr.c        30 May 2014 09:46:33 -0000
@@ -291,7 +291,6 @@
                if(c<ccnt) code[c] = d;
                if(d<ccnt && sflag) squeez[d] = 1;
        }
-       free(vect);
        while((d = next(&string2)) != NIL) {
                if(sflag) squeez[d] = 1;
                if(string2.max==NIL && (string2.p==NULL || *string2.p==0))

(that was from a quick look a few months ago, while this patch will get you going, I can't guarantee it's right. Apply with patch -l).

Now also note that /dev/urandom provides with a stream of bytes. In UTF-8, not all sequences of bytes map to valid characters. For instance, 0x41 0x81 0x41 is not valid because 0x81 is >= 0x80, so it can only occur in a sequence of 2 or more over 0x80 bytes.

An invalid byte, because it's not in the set of characters that is the complement of ☠, will not be deleted by tr.

Better would probably be:

recode ucs-2..u8 < /dev/urandom | tr -cd ☠

ucs-2 being the characters U+0000 to U+FFFF encoded on 2 bytes per character, /dev/urandom looks more like a stream of ucs-2 characters. (we're missing the characters U+10000 to U+10FFFF though).

But that would still include the D800..DFFF surrogate pair range which mbrtowc(3) will choke on (at least with my version of libc).

Those code point are reserved for the purpose of UTF-16 encoding. d800dc00 for instance is the UTF-16BE encoding of U+10000, but there's no U+D800 character or U+DC00. The UTF-8 encoding of those don't make sense as a character either (even if adjacent).

So you'd need to exclude them first:

perl -ne 'BEGIN{$/=\2;binmode STDOUT,":utf8"}
          $c = unpack("n",$_); if ($c < 0xd800 || $c > 0xdfff) {
            no warnings "utf8"; print chr($c)
          }' < /dev/urandom | tr -cd ☠

If the point is to get a stream of random Unicode characters encoded in UTF-8, best would probably to get a random code point in the allowable range (0..0xd7ff, 0xf000..0x10ffff) and convert that to UTF-8. If you want to base it on /dev/urandom, you could use 3 bytes (24 bits) from it for each code point:

perl -ne 'BEGIN{$/=\3;binmode STDOUT,":utf8"}
          $c = unpack("N","\0$_") * 0x10F800 >> 24;
          $c+=0x800 if $c >= 0xd800;
          do {no warnings "utf8"; print chr($c)}' < /dev/urandom |
  tr -cd ☠

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Unicode – Heirloom Toolchest tr Error with Multibyte Characters

Related Question