Does (should) LC_COLLATE affect character ranges

localeregular expression

Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:

unset LANGUAGE LC_ALL
echo B | LC_COLLATE=en_US grep '[a-z]'

Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-z includes everything that's between a and z in the collation order, including the capital letters B through Z.

All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.

What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?

(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_US locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)

On two Debian machines, one where B is in [a-z] and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="ISO-8859-1"

and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"

Best Answer

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).

Instead, use a named character class:

echo B | grep '[[:lower:]]'

This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

For example, if you need to find a particular byte value, use the C locale, which is always available:

echo B | LANG=C grep '[a-z]'

If this doesn't work as expected, it really is a bug.

Locale names

On all current unix variants that I know of (but not on a few antiques), locale names follow the same pattern:

An ISO 639-1 lowercase two-letter language code, or an ISO 639-2 three-letter language code if the language has no two-letter code. For example, en for English, de for German, ja for Japanese, uk for Ukrainian, ber for Berber, …
For many but not all languages, an underscore _ followed by an ISO 3166 uppercase two-letter country code. Thus: en_US for US English, en_UK for British English, fr_CA Canadian (Québec) French, de_DE for German of Germany, de_AT for German of Austria, ja_JP for Japanese (of Japan), etc.
Optionally, a dot . followed by the name of a character encoding such as UTF-8, ISO-8859-1, KOI8-U, GB2312, Big5, etc. With GNU libc at least (I don't know how widespread this is), case and punctuation is ignored in encoding names. For example, zh_CN.UTF-8 is Mandarin (simplified) Chinese encoded in UTF-8, while zh_CN is Mandarin Chinese encoded in GB2312, and zh_TW is Taiwanese (traditional) Chinese encoded in Big5.
Optionally, an at sign @ followed by the name of a variant. The meaning of variants is locale-dependent. For example, many European countries have an @euro locale variant where the currency sign is € and where the encoding is one that includes this character (ISO 8859-15 or ISO 8859-16), as opposed to the unadorned variant with the older currency sign. For example, en_IE (English, Ireland) uses the latin1 (ISO 8859-1) encoding and £ as the currency symbol while en_IE@euro uses the latin9 (ISO 8859-15) encoding and € as the currency symbol.

In addition, there are two locale names that exist on all unix-like system: C and POSIX. These names are synonymous and mean computerese, i.e. default settings that are appropriate for data that is parsed by a computer program.

Locale settings

The following locale categories are defined by POSIX:

LC_CTYPE: the character set used by terminal applications: classification data (which characters are letters, punctuation, spaces, invalid, etc.) and case conversion. Text utilities typically heed LC_CTYPE to determine character boundaries.
LC_COLLATE: collation (i.e. sorting) order. This setting is of very limited use for several reasons:
- Most languages have intricate rules that depend on what is being sorted (e.g. dictionary words and proper names might not use the same order) and cannot be expressed by LC_COLLATE.
- There are few applications where proper sort order matters which are performed by software that uses locale settings. For example, word processors store the language and encoding of a file in the file itself (otherwise the file wouldn't be processed correctly on a system with different locale settings) and don't care about the locale settings specified by the environment.
- LC_COLLATE can have nasty side effects, in particular because it causes the sort order A < a < B < …, which makes “between A and Z” include the lowercase letters a through y. In particular, very common regular expressions like [A-Z] break some applications.
LC_MESSAGES: the language of informational and error messages.
LC_NUMERIC: number formatting: decimal and thousands separator.
Many applications hard-code . as a decimal separator. This makes LC_NUMERIC not very useful and potentially dangerous:
- Even if you set it, you'll still see the default format pretty often.
- You're likely to get into a situation where one application produces locale-dependent output and another application expects . to be the decimal point, or , to be a field separator.
LC_MONETARY: like LC_NUMERIC, but for amounts of local currency.
Very few applications use this.
LC_TIME: date and time formatting: weekday and month names, 12 or 24-hour clock, order of date parts, punctuation, etc.

GNU libc, which you'll find on non-embedded Linux, defines additional locale categories:

LC_PAPER: the default paper size (defined by height and width).
LC_NAME, LC_ADDRESS, LC_TELEPHONE, LC_MEASUREMENT, LC_IDENTIFICATION: I don't know of any application that uses these.

Environment variables

Applications that use locale settings determine them from environment variables.

Then the value of the LANG environment variable is used unless overridden by another setting. If LANG is not set, the default locale is C.
The LC_xxx names can be used as environment variables.
If LC_ALL is set, then all other values are ignored; this is primarily useful to set LC_ALL=C run applications that need to produce the same output regardless of where they are run.
In addition, GNU libc uses LANGUAGE to define fallbacks for LC_MESSAGES (e.g. LANGUAGE=fr_BE:fr_FR:en to prefer Belgian French, or if unavailable France French, or if unavailable English).

Installing locales

Locale data can be large, so some distributions don't ship them in a usable form and instead require an additional installation step.

On Debian, to install locales, run dpkg-reconfigure locales and select from the list in the dialog box, or edit /etc/locale.gen and then run locale-gen.
On Ubuntu, to install locales, run locale-gen with the names of the locales as arguments.

You can define your own locale.

Recommendation

The useful settings are:

Set LC_CTYPE to the language and encoding that you encode your text files in. Ensure that your terminals use that encoding.
For most languages, only the encoding matters. There are a few exceptions; for example, an uppercase i is I in most languages but İ in Turkish (tr_TR).
Set LC_MESSAGES to the language that you want to see messages in.
Set LC_PAPER to en_US if you want US Letter to be the default paper size and just about anything else (e.g. en_GB) if you want A4.
Optionally, set LC_TIME to your favorite time format.

As explained above, avoid setting LC_COLLATE and LC_NUMERIC. If you use LANG, explicitly override these two categories by setting them to C.

Best Answer

Related Solutions

LS – How to Make ls Sort Underscore Characters First

Locale Settings – What to Set and Implications

Locale names

Locale settings

Environment variables

Installing locales

Recommendation

Related Question