Bash – Why Prefer Character Classes Over Character Ranges

bashcommand lineshellwildcards

The Linux Command Line (Book – page count 47) says:

… you have to be very careful with them [character ranges] because they will not produce the expected results unless properly configured. For now, you should avoid using them and use character classes instead.

The book gives no reason, other than that.

Question – 1: So, why exactly should Character Classes (e.g. [:alnum:], [:alpha:], [:digit:], etc) be preferred over Character Ranges (e.g. [a-z], [A-Z], [0-9], etc)?

Question – 2: Does [:alpha:] stand for [a-z], [A-Z], and upper and lower-case alphabets from other languages too? And similarly, does [:digit:] include numerals from other languages too? If they match, that is.

(Two questions, I know, but in this case, they are pretty much interrelated, IMO.)

Best Answer

According to the bash manpage, the LC_COLLATE environment variable affects character ranges, exactly as per Hauke Laging's answer:

LC_COLLATE This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.

On the other hand, LC_CTYPE affects character classes:

LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.

What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.

If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.

It's very difficult to foresee some situations though, unless you've studied linguistics.

However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:

mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!

This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)

This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]* and apply it to a file that starts with ‘Ä’.

Related Solutions

Shell – Character classes: construct the own

I'm afraid that the list of character classes is hard-coded in the C library (e.g. in GNU libc, in the build_charclass function in posix/regcomp.c). The only way to extend it would be to recompile the C library.

You can customize the contents of each existing class in a locale definition.

In most cases, it should be good enough to build your regexp as a string:

myclass='a*[:alnum:][:space:]'
regexp="[$myclass]"

You can't subtract characters from a category this way. And take care if adding ] or - or \ to respect the syntax of character classes in your language's regexes.

Bash Bracket Expression Matching Unexpected Character

That's a consequence of those characters having the same sorting order.

You'll also notice that

sort -u << EOF
■
⅕
⅖
⅗
EOF

returns only one line.

Or that:

expr ■ = ⅕

returns true (as required by POSIX).

Most locales shipped with GNU systems have a number of characters (and even sequences of characters (collating sequences)) that have the same sorting order. In the case of those ■⅕⅖⅗ ones, it's because the order is not defined, and those characters whose order is not defined end up having the same sorting order in GNU systems. There are characters that are explicitly defined as having the same sorting order like Ș and Ş (though there's no apparent (to me anyway) real logic or consistency on how it is done).

That is the source of quite surprising and bogus behaviours. I have raised the issue very recently on the Austin group (the body behind POSIX and the Single UNIX Specification) mailing list and the discussion is still ongoing as of 2015-04-03.

In this case, whether [y] should match x where x and y sort the same is unclear to me, but since a bracket expression is meant to match a collating element, that suggests that the bash behaviour is expected.

In any case, I suppose [⅕-⅕] or at least [⅕-⅖] should match ■.

You'll notice that different tools behave differently. ksh93 behaves like bash, GNU grep or sed don't. Some other shells have different behaviours some like yash even more buggy.

To have a consistent behaviour, you need a locale where all characters sort differently. The C locale is the typical one. However the character set in the C locale on most systems is ASCII. On GNU systems, you generally have access to a C.UTF-8 locale that can be used instead to work on UTF-8 character.

So:

(export LC_ALL=C.UTF-8; [[ ■ = [⅕⅖⅗] ]])

or the standard equivalent:

(export LC_ALL=C.UTF-8
 case ■ in ([⅕⅖⅗]) true;; (*) false; esac)

should return false.

Another alternative would be to set only LC_COLLATE to C which would work on GNU systems, but not necessarily on others where it could fail to specify the sorting order of multi-byte character.

One lesson of that is that equality is not as clear a notion as one would expect when it comes to comparing strings. Equality might mean, from strictest to least strict.

Same number of bytes and all byte constituents have the same value.
Same number of characters and all characters are the same (for instance, refer to the same codepoint in the current charset).
The two strings have the same sorting order as per the locale's collation algorithm (that is, neither a < b nor b > a is true).

Now, for 2 or 3, that assumes both strings contain valid characters. In UTF-8 and some other encodings, some sequence of bytes don't form valid characters.

1 and 2 are not necessarily equivalent because of that, or because some characters may have more than one possible encoding. That's typically the case of stateful encodings like ISO-2022-JP where A can be expressed as 41 or 1b 28 42 41 (1b 28 42 being the sequence to switch to ASCII and you can insert as many of those as you want, that won't make a difference), though I wouldn't expect those types of encoding still being in use, and GNU tools at least generally don't work properly with them.

Also beware that most non-GNU utilities can't deal with the 0 byte value (the NUL character in ASCII).

Which of those definitions is used depends on the utility and utility implementation or version. POSIX is not 100% clear on that. In the C locale, all 3 are equivalent. Outside of that YMMV.

Best Answer

Related Solutions

Shell – Character classes: construct the own

Bash Bracket Expression Matching Unexpected Character

Related Question