Bash – Why Prefer Character Classes Over Character Ranges

bashcommand lineshellwildcards

The Linux Command Line (Book – page count 47) says:

… you have to be very careful with them [character ranges] because they will not produce the expected results unless properly configured. For now, you should avoid using them and use character classes instead.

The book gives no reason, other than that.

Question – 1: So, why exactly should Character Classes (e.g. [:alnum:], [:alpha:], [:digit:], etc) be preferred over Character Ranges (e.g. [a-z], [A-Z], [0-9], etc)?

Question – 2: Does [:alpha:] stand for [a-z], [A-Z], and upper and lower-case alphabets from other languages too? And similarly, does [:digit:] include numerals from other languages too? If they match, that is.

(Two questions, I know, but in this case, they are pretty much interrelated, IMO.)

Best Answer

According to the bash manpage, the LC_COLLATE environment variable affects character ranges, exactly as per Hauke Laging's answer:

LC_COLLATE This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.

On the other hand, LC_CTYPE affects character classes:

LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.

What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.

If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.

It's very difficult to foresee some situations though, unless you've studied linguistics.

However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:

mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!

This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)

This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]* and apply it to a file that starts with ‘Ä’.

Related Question