Which terminal encodings are default on Linux, and which are most common

character encoding

I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).

It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)

But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.

Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.

Best Answer

The oldest character encoding used in consoles like VT52 was ASCII.

That basic decision has been carried over for many years. Most consoles use ASCII as the most basic character set as defined by ANSI. The next set of encodings (in the west) are the ISO-8859 sets (from 1 to 15). One for each language (language group). Being the most common the ISO-8859-1 (English), and the other in proportion to the corresponding language in use.

Then, the most general list of world characters is Unicode, which, in Linux, is usually encoded in UTF-8.

It is that encoding the most common for present day terminals and programs in Linux.

From more general to particular settings:

OS

The default in debian since Etch on Apr 8th 2007 (13 years ago) has been utf-8.

Note : Fresh Debian/Etch installation have UTF8 enabled by default.

And confirmed on the release notes:

The default encoding for new Debian GNU/Linux installations is UTF-8. A number of applications will also be set up to use UTF-8 by default.

What that means is that Debian (and Ubuntu, Mint, and many other) are utf-8 capable by default.

locale

Which encoding (and country) is actually chosen by the user with the command dpkg-reconfigure locales is left to user preferences.

That configure the actual particular setting for the computer locale command.

All of the LC_* "environment variables" have specific effects on each of country/language sections (parts) as defined by the POSIX spec.

tty

But the above are just "general" settings. A particular terminal may (or may not) match it. Well, in general, the usual encoding for most terminals today is utf8.

The encoding for a particular terminal (tty) may be found if set to utf8 with:

$ stty -a | grep -o '.iutf8'
 iutf8

That is, no - before the result printed.

terminal

But the terminal (GUI window) inside which the tty terminal is (usually) running also has its own locale setting. If the settings are sane, probably:

$ locale charmap
UTF-8

Will have the correct answer.

But that is just a quick and very shallow look at all the i18n settings of linux/unix.

Take away: Probably, assuming Linux is using utf8 is your best bet.

Related Solutions

Most common encoding for strings in C++ in Linux (and Unix?)

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

SSH – Working with Filenames in Different Encoding

Inside a terminal emulator that supports UTF-8, you can use the luit command to run a subshell (or other program) in a different locale. The locale setting that indicates character sets is LC_CTYPE.

LC_CTYPE=ru_RU.KOI8-R luit ls   # run one command
LC_CTYPE=ru_RU.KOI8-R luit      # start a shell (type Ctrl+D or exit to return to the parent shell)

If you have a whole tree of files in a different encoding, I recommend (if possible) mounting it through convmvfs.

mkdir ~/net/ivan@example.com.KOI8-R ~/net/ivan@example.com.UTF-8
sshfs ivan@example.com: ~/net/ivan@example.com.KOI8-R
convmvfs -o srcdir=~/net/ivan@example.com.KOI8-R,icharset=KOI8-R,ocharset=UTF-8 ~/net/ivan@example.com.UTF-8
ls ~/net/ivan@example.com.UTF-8