Bash – Why is bash extended-globbing variable substitution acting at the byte level

bashlocalevariable substitutionwildcards

I thought that bash variable substitution and globbing worked at character resolution, so I was rather surprised to see it acting at the byte level.
Everything in my locale is en_AU.UTF-8

When there is nothing to match and the pattern allows zero-to-many, the replacement occurs at the byte level, as seen by subsequent replacements. I would have expected it to move along to the next character, but it doesn't…

Maybe this is just a whacky fringe case pattern, or I'm missing something obvious, but I do wonder what is going on here, and can I expect this behaviour elsewhere besides this particular pattern?

Here is the script (which started as an attempt to split a string into characters).
I expected that the last test, with character , would end up with only a single space preceding the , but instead, the character's 3 UTF-8 bytes are each preceded by a space. This results in invalid UTF-8 output.

shopt -s extglob
for str in  $'\t' "ab"  ळ ;do
    printf -- '%s' "${str//*($'\x01')/ }" |xxd
done

Output:

0000000: 2009                                      .
0000000: 2061 2062                                 a b
0000000: 20e0 20a4 20b3                            . . .

Best Answer

The short answer to your question is that *(pattern-list) will match zero or more occurrences of the given patterns. There are zero instances of Unicode character 0001 between each of the input bytes. So the replace operation replaces each of those zero instances by a single space.

Maybe you meant to do this:

$ for str in  $'\t' "ab"  ळ ; do  
    printf -- '%s' "${str//+($'\x01')/ }" |xxd
  done)
0000000: 09                                       .
0000000: 6162                                     ab
0000000: e0a4 b3                                  ...

But the longer answer is that in any case, pathnames aren't text. At least, they're not as far as the (Unix-like) operating system is concerned. They are byte sequences. The problem is that things like this are trivial to do:

$ LC_ALL=latin1
$ mkdir 'áñ' && cd 'áñ'
$ LC_ALL=ga_IE.iso885915@euro
$ mkdir '€25' && cd '€25'
$ LC_ALL=zh_TW
$ pwd
# ... what should the output be?  And what about the output of:
$ /bin/pwd

Each of those locales includes characters which don't exist in the others. This problem affects things like locate -r and find -regex too; the argument of locate -r is a regular expression which therefore must include support for things like character classes; but you don't know what locale to use to determine the character classes for the characters in the path names or even if there is a single usable locale which can be used to represent all the paths on the system.

Related Question