Bash Globbing not as expected

bashpatternswildcards

This is a homework question:

Match all filenames with 2 or more characters that start with a lower case letter, but do not end with an upper case letter.

I do not understand why my solution is not working.

So I executed the below:

touch aa
touch ha
touch ah
touch hh
touch a123e
touch hX
touch Ax

ls [a-z]*[!A-Z]

Output:

aa  ha

My question: Why did it not match "ah", "hh", or "a123e"?

Best Answer

This is a locale problem. In your locale, [A-Z] expands to something like [AbBcZ...zZ] (plus probably others like accented characters), therefore [^A-Z] actually means "files that end with a" in your example (and only in your example).

If you want to avoid such a surprise, one way is to set LC_COLLATE=C since the collation is the part of your locale settings that is responsible of the sorting order. Also, empty LC_ALL if it is set, as it would take precedence.

$ ls [a-z]*[^A-Z]
aa  ha

$ ( LC_ALL=; LC_COLLATE=C; ls [a-z]*[^A-Z] )
a123e  aa  ah  ha  hh

Or, better, it's probably preferable to not change your locale settings and use the appropriate classes: [:lower:] instead of [a-z] and [:upper:] instead of [A-Z].

$ ls [[:lower:]]*[^[:upper:]]
a123e  aa  ah  ha  hh

Or use bash's globasciiranges option:

$ shopt -s globasciiranges
$ ls [a-z]*[^A-Z]
a123e  aa  ah  ha  hh

$ shopt -u globasciiranges
$ ls [a-z]*[^A-Z]
aa  ha

Related Solutions

Bash – Why is bash extended-globbing variable substitution acting at the byte level

The short answer to your question is that *(pattern-list) will match zero or more occurrences of the given patterns. There are zero instances of Unicode character 0001 between each of the input bytes. So the replace operation replaces each of those zero instances by a single space.

Maybe you meant to do this:

$ for str in  $'\t' "ab"  ळ ; do  
    printf -- '%s' "${str//+($'\x01')/ }" |xxd
  done)
0000000: 09                                       .
0000000: 6162                                     ab
0000000: e0a4 b3                                  ...

But the longer answer is that in any case, pathnames aren't text. At least, they're not as far as the (Unix-like) operating system is concerned. They are byte sequences. The problem is that things like this are trivial to do:

$ LC_ALL=latin1
$ mkdir 'áñ' && cd 'áñ'
$ LC_ALL=ga_IE.iso885915@euro
$ mkdir '€25' && cd '€25'
$ LC_ALL=zh_TW
$ pwd
# ... what should the output be?  And what about the output of:
$ /bin/pwd

Each of those locales includes characters which don't exist in the others. This problem affects things like locate -r and find -regex too; the argument of locate -r is a regular expression which therefore must include support for things like character classes; but you don't know what locale to use to determine the character classes for the characters in the path names or even if there is a single usable locale which can be used to represent all the paths on the system.

Bash – Why Prefer Character Classes Over Character Ranges

According to the bash manpage, the LC_COLLATE environment variable affects character ranges, exactly as per Hauke Laging's answer:

LC_COLLATE This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.

On the other hand, LC_CTYPE affects character classes:

LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.

What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.

If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.

It's very difficult to foresee some situations though, unless you've studied linguistics.

However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:

mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!

This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)

This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]* and apply it to a file that starts with ‘Ä’.

Best Answer

Related Solutions

Bash – Why is bash extended-globbing variable substitution acting at the byte level

Bash – Why Prefer Character Classes Over Character Ranges

Related Question