The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

locale

First off: Although I can understand the relevance of geographic-region detail for LC_TIME, LC_NUMERIC, and most other LC_* vars, I don't quite see how 'en_AU' relates to LC_CTYPE…
Isn't UTF-8 (and other encodings) enough of a definition in themselves (for LC_CTYPE), as and encoding by definiton is consistant?…

Thinking about it as I write, it may be that different regions of the world capitalize their lower-case letters differently… If this is the case, how does iconv handle this?..
This iconv point is actually what started me on this line of thought, becaue it doesn't ask for a locale, it only asks for the input encoding format.

My next puzzle is: What do the line-items in the output from locale LC_CTYPE refer to, and/or where is a good place to get a layout… Perhaps a more relevant question is: By whom, and where, would this info be needed? ..
I'm pretty sure I don't need it… but it all help to fill in the picture of 'scripts' and 'encodings' and 'locales'; which is surprisingly non-trivial as soon as you leave the world of ASCII.

Best Answer

All the locale variables use the same locale name so that you can specify your favorite locale in a single swoop, e.g. LANG=en_AU.utf8. As you surmise, the country information is occasionally relevant even in LC_CTYPE, e.g. the uppercase version of i is I in most languages but İ in Turkish (tr_TR.utf8). But don't expect miracles; for example the lowercase-uppercase correspondence is one-to-one, so there's no good uppercase version of ß in de_DE.iso8859-1 (it should be SS).

You'll have an easier time understanding the output of locale -k LC_CTYPE, with -k to see the keyword names in addition to the values (without -k, the output format is designed so you can get the value of a specific keyword, e.g. locale ctype-width). The list of keywords and their meanings is system-dependent, as is the way locale data is stored, and doesn't interest many people, so you may not find much documentation outside the source code of your C library. By far the most useful form of the locale command is locale -a to list available locale names.

For GNU libc (i.e. non-embedded Linux):

All locale data other than messages is stored in /usr/lib/locale/locale-archive. This file is generated by localedef from data in /usr/share/i18n and /usr/local/share/i18n. The format of the locale definition files in /usr/share/i18n/locales is only documented in the source code, I think.
The format of the character set and encoding definition files in /usr/share/i18n/charmaps is standardized by POSIX:2001. These files (or, in GNU libc, the compiled version in /usr/lib/locale/locale-archive) are used by the iconv programming and commmand line facility. Encoding conversions also rely on code in /usr/lib/gconv/*.so. The Gnu libc manual documents how to write your own gconv module, though that section contains the text “This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources.”.
Message catalogs get special treatment because each application comes with its own set. Message catalogs live in /usr/share/locale/*/LC_MESSAGES. The manual contains documentation for application writers. GNU libc supports both the POSIX interface catgets and the more powerful gettext interface.

Written languages are indeed very complicated, even if you don't stray far from English. Are the French and German ü the same character (is a “tréma” exactly the same as an “umlaut”, and does it matter that French and German printers typeset the accent at a slightly different height)? What is the uppercase of i (it's İ in Turkish)? Does Ö transliterate to O if you only have ASCII (in German, it's OE)? Where is Ä sorted in a dictionary (in Swedish, it's after Z)? And that's just a few examples with European languages written in the latin alphabet! The Unicode mailing list has a lot of examples and sometimes heated discussions on such topics.

Related Solutions

Bash – LC_CTYPE breaking autocomplete: what is the cause of this problem

I have a hunch that something your bash_completion is causing this to happen. Try clearing out your bash completion temporarily (until you exit) by doing:

complete -r

If that clears it up then it's something with bash completion, if not it still might be one of the bash built-ins

Arch Linux – Why Is Almost Every Program Complaining About Locale?

You're missing a file which would be used to default the locale in the absence of $LANG or $LC_ALL (or all of the more specific $LC_whatever) being set.

On older glibc, it's /usr/lib/locale/locale-archive. Because GNU/Linux is chaotic, you should use strace to determine which files are expected in the particular versions in use on your machine:

strace -e file locale
execve("/usr/bin/locale", ["locale"], [/* 36 vars */]) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/libc.so.6", O_RDONLY)        = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3

----------------------Comments added 1 day later:

ltrace -S should be okay, since it shows syscalls.

Otherwise, "ltrace" is not very helpful (i.e. it's counterproductive versus strace), because it only shows the uppermost calls. Those are obvious (setlocale(3)), whereas the real problem happens within libc.

It sounds like you have the raw locale data installed, since en_US.UTF-8 works.

If so, then something like this should fix your problem, setting a system-wide default:

localedef -f UTF-8 -i en_US en_US.UTF-8

Best Answer

Related Solutions

Bash – LC_CTYPE breaking autocomplete: what is the cause of this problem

Arch Linux – Why Is Almost Every Program Complaining About Locale?

Related Question