Character Encoding – Impact of C Locale Being UTF-8 Instead of ASCII

character encodingcompatibilitylocaleposixunicode

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.

Best Answer

The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).

Related Solutions

The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

All the locale variables use the same locale name so that you can specify your favorite locale in a single swoop, e.g. LANG=en_AU.utf8. As you surmise, the country information is occasionally relevant even in LC_CTYPE, e.g. the uppercase version of i is I in most languages but İ in Turkish (tr_TR.utf8). But don't expect miracles; for example the lowercase-uppercase correspondence is one-to-one, so there's no good uppercase version of ß in de_DE.iso8859-1 (it should be SS).

You'll have an easier time understanding the output of locale -k LC_CTYPE, with -k to see the keyword names in addition to the values (without -k, the output format is designed so you can get the value of a specific keyword, e.g. locale ctype-width). The list of keywords and their meanings is system-dependent, as is the way locale data is stored, and doesn't interest many people, so you may not find much documentation outside the source code of your C library. By far the most useful form of the locale command is locale -a to list available locale names.

For GNU libc (i.e. non-embedded Linux):

All locale data other than messages is stored in /usr/lib/locale/locale-archive. This file is generated by localedef from data in /usr/share/i18n and /usr/local/share/i18n. The format of the locale definition files in /usr/share/i18n/locales is only documented in the source code, I think.
The format of the character set and encoding definition files in /usr/share/i18n/charmaps is standardized by POSIX:2001. These files (or, in GNU libc, the compiled version in /usr/lib/locale/locale-archive) are used by the iconv programming and commmand line facility. Encoding conversions also rely on code in /usr/lib/gconv/*.so. The Gnu libc manual documents how to write your own gconv module, though that section contains the text “This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources.”.
Message catalogs get special treatment because each application comes with its own set. Message catalogs live in /usr/share/locale/*/LC_MESSAGES. The manual contains documentation for application writers. GNU libc supports both the POSIX interface catgets and the more powerful gettext interface.

Written languages are indeed very complicated, even if you don't stray far from English. Are the French and German ü the same character (is a “tréma” exactly the same as an “umlaut”, and does it matter that French and German printers typeset the accent at a slightly different height)? What is the uppercase of i (it's İ in Turkish)? Does Ö transliterate to O if you only have ASCII (in German, it's OE)? Where is Ä sorted in a dictionary (in Swedish, it's after Z)? And that's just a few examples with European languages written in the latin alphabet! The Unicode mailing list has a lot of examples and sometimes heated discussions on such topics.

Most common encoding for strings in C++ in Linux (and Unix?)

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

Best Answer

Related Solutions

The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

Most common encoding for strings in C++ in Linux (and Unix?)

Related Question