Bash – the data structure of $@ in shell

bashshell

We usually use $@ to represent all of argument except $0. However, I don't know what data structure $@ is.

Why it behave differently with $* when including in double quote, could anyone give me a interpreter-level explanation?

It can be iterated in for loop, so it seems to be array.
However, it can also echoed entirely with simple echo $@, if it is an array, only first element will be shown. Due to the limitation of shell, I cannot write more experiment code to carry it out.

Difference between this post: This post show how $@ behaves differently from $*. But I am wondering about the data type of $@. Shell as a interpreting language, like Python, should representing data according to a series of fundamental types. Or in other words, I want to know how $@ stored in computer memory.

Is it a string, a multi-line string or a array?

If it is a unique data type, is it possible to define a custom variable as an instance of this type?

Best Answer

That started as a hack in the Bourne shell. In the Bourne shell, IFS word splitting was done (after tokenisation) on all words in list context (command line arguments or the words the for loops loop on). If you had:

IFS=i var=file2.txt
edit file.txt $var

That second line would be tokenised in 3 words, $var would be expanded, and split+glob would be done on all three words, so you would end up running ed with t, f, le.txt, f, le2.txt as arguments.

Quoting parts of that would prevent the split+glob. The Bourne shell initially remembered which characters were quoted by setting the 8th bit on them internally (that changed later when Unix became 8bit clean, but the shell still did something similar to remember which byte was quoted).

Both $* and $@ were the concatenation of the positional parameters with space in-between. But there was a special processing of $@ when inside double-quotes. If $1 contained foo bar and $2 contained baz, "$@" would expand to:

foo bar baz
^^^^^^^ ^^^

(with the ^s above indicating which of the characters have the 8th bit set). Where the first space was quoted (had the 8th bit set) but not the second one (the one added in-between words).

And it's the IFS splitting that takes care of separating the arguments (assuming the space character is in $IFS as it is by default). That's similar to how $* was expanded in its predecessor the Mashey shell (itself based on the Thomson shell, while the Bourne shell was written from scratch).

That explains why in the Bourne shell initially "$@" would expand to the empty string instead of nothing at all when the list of positional parameters was empty (you had to work around it with ${1+"$@"}), why it didn't keep the empty positional parameters and why "$@" didn't work when $IFS didn't contain the space character.

The intention was to be able to pass the list of arguments verbatim to another command, but that didn't work properly for the empty list, for empty elements or when $IFS didn't contain space (the first two issues were eventually fixed in later versions).

The Korn shell (on which the POSIX spec is based) changed that behaviour in a few ways:

IFS splitting is only done on the result of unquoted expansions (not on literal words like edit or file.txt in the example above)
$* and $@ are joined with the first character of $IFS or space when $IFS is empty except that for a quoted "$@", that joiner is unquoted like in the Bourne shell, and for a quoted "$*" when IFS is empty, the positional parameters are appended without separator.
it added support for arrays, and with ${array[@]} ${array[*]} reminiscent of Bourne's $* and $@ but starting at indice 0 instead of 1, and sparse (more like associative arrays) which means $@ cannot really be treated as a ksh array (compare with csh/rc/zsh/fish/yash where $argv/$* are normal arrays).
The empty elements are preserved.
"$@" when $# is 0 now expands to nothing instead of the empty string, "$@" works when $IFS doesn't contain spaces except when IFS is empty. An unquoted $* without wildcards expands to one argument (where the positional parameters are joined with space) when $IFS is empty.

ksh93 fixed the remaining few problems above. In ksh93, $* and $@ expands to the list of positional parameters, separated regardless of the value of $IFS, and then further split+globbed+brace-expanded in list contexts, $* joined with first byte (not character) of $IFS, "$@" in list contexts expands to the list of positional parameters, regardless of the value of $IFS. In non-list context, like in var=$@, $@ is joined with space regardless of the value of $IFS.

bash's arrays are designed after the ksh ones. The differences are:

no brace-expand upon unquoted expansion
first character of $IFS instead of for byte
some corner case differences like the expansion of $* when non-quoted in non-list context when $IFS is empty.

While the POSIX spec used to be pretty vague, it now more or less specifies the bash behaviour.

It's different from normal arrays in ksh or bash in that:

Indices start at 1 instead of 0 (except in "${@:0}" which includes $0 (not a positional parameter, and in functions gives you the name of the function or not depending on the shell and how the function was defined)).
You can't assign elements individually
it's not sparse, you can't unset elements individually
shift can be used.

In zsh or yash where arrays are normal arrays (not sparse, indices start at one like in all other shells but ksh/bash), $* is treated as a normal array. zsh has $argv as an alias for it (for compatibility with csh). $* is the same as $argv or ${argv[*]} (arguments joined with the first character of $IFS but still separated out in list contexts). "$@" like "${argv[@]}" or "${*[@]}"} undergoes the Korn-style special processing.

Related Solutions

Bash – the name of the shell feature `>(tee copyError.txt >&2)`

From man bash:

   Process Substitution
       Process substitution is supported  on  systems  that  support
       named  pipes  (FIFOs)  or  the  /dev/fd method of naming open
       files.  It takes the form of <(list) or >(list).  The process
       list  is  run with its input or output connected to a FIFO or
       some file in /dev/fd.  The name of this file is passed as  an
       argument  to  the current command as the result of the expan‐
       sion.  If the >(list) form is used, writing to the file  will
       provide  input  for  list.   If the <(list) form is used, the
       file passed as an argument should be read to obtain the  out‐
       put of list.

You can search manpages by pressing / and then typing your search string, which is a good way of finding information like this. It does of course require that you know in which manpage to search :)

You have to quote the ( though, because it has a special meaning when searching. To find the relevant section in the bash manpage, type />\(.

Why Not to Parse ls Command and What to Use Instead

I am not at all convinced of this, but let's suppose for the sake of argument that you could, if you're prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell is not a good language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you're faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that's a strong indication that whatever you are doing is too complicated for shell and you should rewrite the entire thing in Perl or Python. Here's your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames -- the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn't matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist: 
   sys.stdout.write("%d %s %s\n" % (ino, subdir, f))

Best Answer

Related Solutions

Bash – the name of the shell feature `>(tee copyError.txt >&2)`

Why Not to Parse ls Command and What to Use Instead

Related Question