Does (should) LC_COLLATE affect character ranges

localeregular expression

Collation order through LC_COLLATE defines not only the sort order of individual characters, but also the meaning of character ranges. Or does it? Consider the following snippet:

unset LANGUAGE LC_ALL
echo B | LC_COLLATE=en_US grep '[a-z]'

Intuitively, B isn't in [a-z], so this shouldn't output anything. That's what happens on Ubuntu 8.04 or 10.04. But on some machines running Debian lenny or squeeze, B is found, because the range a-z includes everything that's between a and z in the collation order, including the capital letters B through Z.

All systems tested do have the en_US locale generated. I also tried varying the locale: on the machines where B is matched above, the same happens in every available locale (mostly latin-based: {en_{AU,CA,GB,IE,US},fr_FR,it_IT,es_ES,de_DE}{iso8859-1,iso8859-15,utf-8}, also Chinese locales) except Japanese (in any available encoding) and C/POSIX.

What do character ranges mean in regular expressions, when you go beyond ASCII? Why is there a difference between some Debian installations on the one hand, and other Debian installations and Ubuntu on the other? How do other systems behave? Who's right, and who should have a bug reported against?

(Note that I'm specifically asking about the behavior of character ranges such as [a-z] in en_US locales, primarily on GNU libc-based systems. I'm not asking how to match lowercase letters or ASCII lowercase letters.)


On two Debian machines, one where B is in [a-z] and one where it isn't, the output of LC_COLLATE=en_US locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="ISO-8859-1"

and the output of LC_COLLATE=en_US.utf8 locale -k LC_COLLATE is

collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=2039
collate-codeset="UTF-8"

Best Answer

If you are using anything other than the C locale, you shouldn't be using ranges like [a-z] since these are locale-dependent and don't always give the results you would expect. As well as the case issue you've already encountered, some locales treat characters with diacritics (eg á) the same as the base character (ie a).

Instead, use a named character class:

echo B | grep '[[:lower:]]'

This will always give the correct result for the locale. However, you need to choose the locale to reflect the meaning of both your input text and the test you are trying to apply.

For example, if you need to find a particular byte value, use the C locale, which is always available:

echo B | LANG=C grep '[a-z]'

If this doesn't work as expected, it really is a bug.