Centos – the proper encoding name to use in locale for UTF-8

centosgdmlocalex11

Just wondering here as I have on this CentOS 7 system:

$ locale -a 
<snip>
en_US.utf8
<snip>

and yet:

$ localectl 
System Locale: LANG=en_US.UTF-8

To add to that, the preferred name according to X11 (/usr/share/X11/locale/locale.dir) is:

$ grep 'en_US.UTF-8$' /usr/share/X11/locale/locale.dir 
en_US.UTF-8/XLC_LOCALE                  en_US.UTF-8
en_US.UTF-8/XLC_LOCALE:                 en_US.UTF-8

Luckily for en_US.utf8, there is an alias:

$ grep 'en_US.utf8' /usr/share/X11/locale/locale.alias
en_US.utf8                                      en_US.UTF-8
en_US.utf8:                                     en_US.UTF-8

Some others aren't so lucky e.g. ru_UA.utf8:

$ locale -a | grep ru_UA.utf8
ru_UA.utf8
$ grep 'ru_UA.utf8' /usr/share/X11/locale/locale.alias
$ grep 'ru_UA.UTF-8' /usr/share/X11/locale/locale.dir
en_US.UTF-8/XLC_LOCALE                  ru_UA.UTF-8
en_US.UTF-8/XLC_LOCALE:                 ru_UA.UTF-8

The reason this is somewhat annoying if the selected locale is not in the X11 locale.alias is that GDM (or gnome-session?) forces the use of the "utf8" version, breaking X programs with messages like: "Warning: locale not supported by Xlib, locale set to C". I could just edit /usr/share/X11/locale/locale.alias, but it would be nice to have more info on which version is actually right.

Best Answer

Comments in GNU libc sources (intl/l10nflist.c:_nl_normalize_codeset) state:

There is no standard for the codeset names.

Codeset names are normalized by that function to all-lowercase with all non-alphanumeric characters stripped i.e. "UTF-8" turns into "utf8".

The locale names inside the locale archive are using normalized codeset names.

Since there is no standard, GDM is well within its rights to use "utf8" and locales like 'ru_UA.utf8' are not invalid. "utf8" may not be preferred, but it is definitely acceptable (at least by libc standards) as it is the normalized form.

Related Solutions

Character Encoding – Impact of C Locale Being UTF-8 Instead of ASCII

The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).

Bash – How to Determine if Current Locale Uses UTF-8 Encoding

From Wikipedia :

On POSIX platforms, locale identifiers are defined similarly to the BCP 47 definition of language tags, but the locale variant modifier is defined differently, and the character encoding is included as a part of the identifier.

It is defined in this format: [language[_territory][.codeset][@modifier]]. (For example, Australian English using the UTF-8 encoding is en_AU.UTF-8.)

However, if the codeset suffix is missing in the locale identifier, for example as in en_AG (see this question), then the codeset is defined by a default setting for that locale, which could very well be UTF-8. As a result, the current encoding cannot be determined by looking at the LANG environment variable.

Further, the locale command only shows the current values of the environment variables.. so it seems that that command cannot be used to determine the codeset either..

However, there is a Perl module I18N::Langinfo, see also this question that seems to be a solution:

perl -MI18N::Langinfo=langinfo,CODESET -E 'say "Uses UTF-8 encoding .." if langinfo(CODESET()) eq "UTF-8"'

This Perl module is a wrapper for the C library function nl_langinfo.

Best Answer

Related Solutions

Character Encoding – Impact of C Locale Being UTF-8 Instead of ASCII

Bash – How to Determine if Current Locale Uses UTF-8 Encoding

Related Question