Debian – Difference between locale-archive and Machine Object files in /usr/share/locale//LC_MESSAGES/ directory

debianlocale

As I understand, locale-gen utility generates /usr/lib/locale/locale-archive database based on entries in /etc/locale.gen file and locale template/configuration files in /usr/share/i18n/locales/. In addition, utilities store their translation files in Machine Object format under /usr/share/locale/<locale_dir>/LC_MESSAGES/ directory. For example:

# dpkg -L wget | grep nl
/usr/share/locale/nl
/usr/share/locale/nl/LC_MESSAGES
/usr/share/locale/nl/LC_MESSAGES/wget.mo
#

When I execute for example strace -e open wget, then I can see that both /usr/lib/locale/locale-archive and /usr/share/locale/nl/LC_MESSAGES/wget.mo files are opened.

What localization data is stored in files in /usr/share/locale/<locale_dir>/LC_MESSAGES/ directory and what localization data is stored in /usr/lib/locale/locale-archive?

Best Answer

While knowing nearly nothing upfront about how localization is implemented in Linux, I tried my best to get my head around it.

Brief Description

/usr/lib/locale/locale-archive

locale-archive is a memory-mapped file which is generated by locale-gen(8) invoking localedef(1). Memory-mapped means that once it is created and called by a program it is only loaded once into memory.
Since all language sets defined in /etc/locale.gen are predefined and the archive itself is highly static, there is no need in having it multiple times in memory. Thus, everytime it is called by another program, the process gets pointed to the archive already loaded in memory, therefore only adding up to the programs virtual memory. This way not only the physical memory footprint of the process is lowered, but also every syscal concerning localization is sped up. (no additional disk I/O needed!)

Also, it seems to work as a sort of failback locale file containing all system wide languages. In addition, the archive is heavily used by software written with glibc.

/usr/share/locale/$LOCALE_DIR/LC_MESSAGES/$PROGRAM.mo

Internationalization (i18n, 18 chars between 'i' and 'n') of software in Linux can be achieved by using GNU-gettext.

When a program is written, every print statement is adapted to use GNUs gettext() function wrapping the string wich needs to be printed.
Then, xgettext(1) iterates over the source, creating .pot (Portable Object Template Files) on its way.
The human translator can then use msginit(1) to parse it into .po (Portable Object) files, generally representing a message catalog. Then all strings get translated by hand.
After that, msgfmt(1) is used to compile the edited .po file into binary .mo (Message Object) files. These can be shipped along with the software package.

When installing a package on a system, /usr/share/locale/<locale_dir>/LC_MESSAGES/ gets populated with $PROGRAM.mo files. When e.g. invoking wget, your LANG env-variable will point wget to use your current locale-setting, which results in wget including the right precompiled translations via pointers into the read .mo binary.

Additional Details and Sources

For locale-archive:

Memory-mapping: CentOS-Mailing-List
Methods I18N Subpackaging: Fedora Documentation on different locale-archive compilations

Also consider manpages for locale(1), localedef(1) and locale-gen(8).

For .mo files:

Process of creating .mo files: Wikipedia on Gettext
GNU MO File Format: explanation and binary format

Also consider manpages for xgettext(1), msginit(1) and msgfmt(1).

Also take a look at the ENV Variables LC_MESSAGE and LOCPATH.

I am sure this only scratched the surface of this vast topic. Nevertheless I hope this is enough to get you started.

Related Solutions

Debian – Swedish unicode characters in xdm / xlogin

I suspect xdm does not support UTF-8 even though your environment may be set that way. It is is still down to the application to handle the interpretation of strings and the encoding they may contain.

To fix this issue, I removed the utf8 encoded strings and replaced them with with their ISO-8859-15 counterpart (you can get the list of iso-8859-15 sequences to use with man iso_8859-15). So this seemed to work for me:

xlogin*greeting: V\344lkommen till CLIENTHOST
xlogin*namePrompt: Anv\344ndare:
xlogin*fail: Fel l\366senord!

This also meant I didn't need to set anything in Xsetup either (I was originally trying to use sv_SE.utf8).

Locale en_AG vs en_AG.utf8 – Key Differences Explained

When you give a locale by the name language_COUNTRY, you actually specify one of the locales defined as language_COUNTRY.codeset: the default one for this language and country. In the case of en_AG, it appears that the default codeset is UTF8. For en_US, it is ISO-8859-1, and therefore en_US is in fact equivalent to en_US.ISO-8859-1.