The short answer to your question is that *(pattern-list) will match zero or more occurrences of the given patterns. There are zero instances of Unicode character 0001 between each of the input bytes. So the replace operation replaces each of those zero instances by a single space.
Maybe you meant to do this:
$ for str in $'\t' "ab" ळ ; do
printf -- '%s' "${str//+($'\x01')/ }" |xxd
done)
0000000: 09 .
0000000: 6162 ab
0000000: e0a4 b3 ...
But the longer answer is that in any case, pathnames aren't text. At least, they're not as far as the (Unix-like) operating system is concerned. They are byte sequences. The problem is that things like this are trivial to do:
$ LC_ALL=latin1
$ mkdir 'áñ' && cd 'áñ'
$ LC_ALL=ga_IE.iso885915@euro
$ mkdir '€25' && cd '€25'
$ LC_ALL=zh_TW
$ pwd
# ... what should the output be? And what about the output of:
$ /bin/pwd
Each of those locales includes characters which don't exist in the others. This problem affects things like locate -r and find -regex too; the argument of locate -r is a regular expression which therefore must include support for things like character classes; but you don't know what locale to use to determine the character classes for the characters in the path names or even if there is a single usable locale which can be used to represent all the paths on the system.
According to the bash
manpage, the LC_COLLATE
environment variable affects character ranges, exactly as per Hauke Laging's answer:
LC_COLLATE
This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of
range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.
On the other hand, LC_CTYPE
affects character classes:
LC_CTYPE This variable determines the interpretation of characters and the behavior of character classes within pathname expansion and pattern matching.
What this means is that both cases are potentially problematic if you're thinking in a English, left-to-right, Latin alphabet, Arabic-digit context.
If you're really proper, and/or are scripting for a multi-locale environment, it's probably best to make sure you know what your locale variables are when you're matching files, or to be sure that you're coding in a completely generic way.
It's very difficult to foresee some situations though, unless you've studied linguistics.
However, I don't know of a Latin-using locale that changes the order of letters, so [a-z] would work. There are extensions to the Latin alphabet that collate ligatures and diacriticals differently. However, here's a little experiment:
mkdir /tmp/test
cd /tmp/test
export LC_CTYPE=de_DE.UTF-8
export LC_COLLATE=de_DE.UTF-8
touch Grüßen
ls G* # This says ‘Grüßen’
ls *[a-z]en # This says nothing!
ls *[a-zß]en # This says ‘Grüßen’
ls Gr[a-z]*en # This says nothing!
This is interesting: at least for German, neither diacriticals like ü nor ligatures like ß are folded into latin characters. (either that, or I messed up the locale change!)
This may be bad for you, of course, if you're trying to find filenames that start with a letter, use [a-z]*
and apply it to a file that starts with ‘Ä’.
Best Answer
This is a locale problem. In your locale,
[A-Z]
expands to something like[AbBcZ...zZ]
(plus probably others like accented characters), therefore[^A-Z]
actually means "files that end witha
" in your example (and only in your example).If you want to avoid such a surprise, one way is to set
LC_COLLATE=C
since the collation is the part of your locale settings that is responsible of the sorting order. Also, emptyLC_ALL
if it is set, as it would take precedence.Or, better, it's probably preferable to not change your locale settings and use the appropriate classes:
[:lower:]
instead of[a-z]
and[:upper:]
instead of[A-Z]
.Or use bash's
globasciiranges
option: