Bash – Why is bash extended-globbing variable substitution acting at the byte level

bashlocalevariable substitutionwildcards

I thought that bash variable substitution and globbing worked at character resolution, so I was rather surprised to see it acting at the byte level.
Everything in my locale is en_AU.UTF-8

When there is nothing to match and the pattern allows zero-to-many, the replacement occurs at the byte level, as seen by subsequent replacements. I would have expected it to move along to the next character, but it doesn't…

Maybe this is just a whacky fringe case pattern, or I'm missing something obvious, but I do wonder what is going on here, and can I expect this behaviour elsewhere besides this particular pattern?

Here is the script (which started as an attempt to split a string into characters).
I expected that the last test, with character ळ, would end up with only a single space preceding the ळ, but instead, the character's 3 UTF-8 bytes are each preceded by a space. This results in invalid UTF-8 output.

shopt -s extglob
for str in  $'\t' "ab"  ळ ;do
    printf -- '%s' "${str//*($'\x01')/ }" |xxd
done

Output:

0000000: 2009                                      .
0000000: 2061 2062                                 a b
0000000: 20e0 20a4 20b3                            . . .

Best Answer

The short answer to your question is that *(pattern-list) will match zero or more occurrences of the given patterns. There are zero instances of Unicode character 0001 between each of the input bytes. So the replace operation replaces each of those zero instances by a single space.

Maybe you meant to do this:

$ for str in  $'\t' "ab"  ळ ; do  
    printf -- '%s' "${str//+($'\x01')/ }" |xxd
  done)
0000000: 09                                       .
0000000: 6162                                     ab
0000000: e0a4 b3                                  ...

But the longer answer is that in any case, pathnames aren't text. At least, they're not as far as the (Unix-like) operating system is concerned. They are byte sequences. The problem is that things like this are trivial to do:

$ LC_ALL=latin1
$ mkdir 'áñ' && cd 'áñ'
$ LC_ALL=ga_IE.iso885915@euro
$ mkdir '€25' && cd '€25'
$ LC_ALL=zh_TW
$ pwd
# ... what should the output be?  And what about the output of:
$ /bin/pwd

Each of those locales includes characters which don't exist in the others. This problem affects things like locate -r and find -regex too; the argument of locate -r is a regular expression which therefore must include support for things like character classes; but you don't know what locale to use to determine the character classes for the characters in the path names or even if there is a single usable locale which can be used to represent all the paths on the system.

Related Solutions

Bash – Variable Substitution with Exclamation Mark

That is an indirect expansion, documented in man bash section EXPANSION, subsection Parameter Expansion:

If the first character of parameter is an exclamation point (!), a level of variable indirection is introduced. Bash uses the value of the variable formed from the rest of parameter as the name of the variable; this variable is then expanded and that value is used in the rest of the substitution, rather than the value of parameter itself. This is known as indirect expansion.

bash-4.2$ DDF_SOURCE="siebel_DATA_DATE_FORMAT"

bash-4.2$ siebel_DATA_DATE_FORMAT='Hello Indirect Redirection'

bash-4.2$ DATA_DATE_FORMAT=${!DDF_SOURCE} # siebel_DATA_DATE_FORMAT must get value before this line

bash-4.2$ echo $DATA_DATE_FORMAT
Hello Indirect Redirection

Bash – Guide to Extended Globbing

Bash has no feature to expand just one match out of many.

The pattern @(foo) matches just one occurrence of the pattern foo. That is, it matches foo, but not foofoo. This syntactic form is useful to build or patterns like @(foo|bar), which matches either foo or bar. It can be used as part of longer patterns like @(foo|bar)-*.txt, which matches foo-hello.txt, foo-42.txt, bar-42.txt, etc.

If you want to use one match among many, you can put the matches in an array, and then use an element of the array.

kernels=(vmlinuz*)
ls -l "${kernels[0]}"

Matches are always sorted in lexicographic order, so this will print the first match in lexicographic order.

Note that if the pattern doesn't match any file, you'll get an array containing a single element which is the unchanged pattern:

$ a=(doesnotmatchanything*)
$ ls -l "${a[0]}"
ls: cannot access doesnotmatchanything*: No such file or directory

Set the nullglob option to get an empty array instead.

shopt -s nullglob
kernels=(vmlinuz*)
if ((${#kernels[@]} == 0)); then
  echo "No kernels here"
else
  echo "One of the ${#kernels[@]} kernels is ${kernels[0]}"
fi

Zsh has convenient features here. The glob qualifier [NUM] causes the pattern to expand to only the NUMth match; the variant [NUM1,NUM2] expands to the NUM1th through NUM2th matches (starting at 1).

% ls -l vmlinuz*([1])
lrwxrwxrwx 1 root root 26 Nov 15 21:12 vmlinuz -> vmlinuz-3.16-0.bpo.3-amd64
% ls -l nosuchfilehere*([1])
zsh: no matches found: nosuchfilehere*([1])

The glob qualifier N causes the pattern to expand to an empty list if no file is matched.

kernels=(vmlinuz*(N))
if ((#kernels)); then
  ls -l $kernels
else
  echo "No kernels here"
fi

The glob qualifier om sorts matches by increasing age instead of by name (m is for modification time); Om sorts by decreasing age . So vmlinuz*(om[1]) expands to the most recent kernel file.

Best Answer

Related Solutions

Bash – Variable Substitution with Exclamation Mark

Bash – Guide to Extended Globbing

Related Question