Which terminal encodings are default on Linux, and which are most common

character encoding

I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).

It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)

But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.

Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.

Best Answer

The oldest character encoding used in consoles like VT52 was ASCII.

That basic decision has been carried over for many years. Most consoles use ASCII as the most basic character set as defined by ANSI. The next set of encodings (in the west) are the ISO-8859 sets (from 1 to 15). One for each language (language group). Being the most common the ISO-8859-1 (English), and the other in proportion to the corresponding language in use.

Then, the most general list of world characters is Unicode, which, in Linux, is usually encoded in UTF-8.

It is that encoding the most common for present day terminals and programs in Linux.


From more general to particular settings:

OS

The default in debian since Etch on Apr 8th 2007 (13 years ago) has been utf-8.

Note : Fresh Debian/Etch installation have UTF8 enabled by default.

And confirmed on the release notes:

The default encoding for new Debian GNU/Linux installations is UTF-8. A number of applications will also be set up to use UTF-8 by default.

What that means is that Debian (and Ubuntu, Mint, and many other) are utf-8 capable by default.

locale

Which encoding (and country) is actually chosen by the user with the command dpkg-reconfigure locales is left to user preferences.

That configure the actual particular setting for the computer locale command.

All of the LC_* "environment variables" have specific effects on each of country/language sections (parts) as defined by the POSIX spec.

tty

But the above are just "general" settings. A particular terminal may (or may not) match it. Well, in general, the usual encoding for most terminals today is utf8.

The encoding for a particular terminal (tty) may be found if set to utf8 with:

$ stty -a | grep -o '.iutf8'
 iutf8

That is, no - before the result printed.

terminal

But the terminal (GUI window) inside which the tty terminal is (usually) running also has its own locale setting. If the settings are sane, probably:

$ locale charmap
UTF-8

Will have the correct answer.

But that is just a quick and very shallow look at all the i18n settings of linux/unix.

Take away: Probably, assuming Linux is using utf8 is your best bet.