Character Encoding – Impact of C Locale Being UTF-8 Instead of ASCII

character encodingcompatibilitylocaleposixunicode

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.

Best Answer

The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).