Shell – Understand the Order Between Expansions

posixshell

From POSIX 7:

The order of word expansion shall be as follows:

Tilde expansion (see Section 2.6.1), parameter expansion (see Section 2.6.2), command substitution (see Section 2.6.3),
and arithmetic expansion (see Section 2.6.4) shall be performed,
beginning to end. See item 5 in Section 2.3.

Field splitting (see Section 2.6.5) shall be performed on the portions of the fields generated by step 1, unless IFS is
null.

Pathname expansion (see Section 2.6.6) shall be performed, unless set −f is in effect.

Quote removal (see Section 2.6.7) shall always be performed last.

Do tilde expansion, parameter expansion, command substitution,
and arithmetic expansion perform in the specified order?

Does the order between them matter? If yes, how shall we understand why the order is as specified?
Why does pathname expansion happen after field splitting, while other expansions before field splitting?

In particular, both tilde expansion and pathname expansion are about pathnames and filenames, why are they placed differently with respect to field splitting?
Is there no brace expansion in POSIX?
I notice "word expansion". Do expansions apply only to tokens with token identifier WORD, and not to tokens with other token identifiers (e.g. NAME, specific operator, NEWLINE, IO_NUMBER, ASSIGNMENT)?

Best Answer

Tilde expansion, parameter expansion, command substitution and arithmetic expansion are listed in the same step. That means that they are performed at the same time. The result of tilde expansion does not undergo parameter expansion, the result of parameter expansion does not undergo tilde expansion, and so on. For example, if the value of foo is $(bar) qux, then the word $foo expands to $(bar) qux at step 1; the text resulting from parameter expansion is not subject to any further transformation at step 1, but it then gets split by step 2.

“Beginning to end” means left-to-right processing, which matters e.g. when assignments occur: a=1; echo $a$((a=2))$a prints 122, because arithmetic expansion of $((a=2)) is performed, setting a to 2, between the parameter expansion of the first $a and the parameter expansion of the second $a.

The reason for the order is historical usage. POSIX usually follows existing implementation, it rarely specifies new behavior. There are multiple shells around; for the most part, POSIX follows the Korn shell but omits most features that are not present in the Bourne shell (as the Bourne shell is largely abandoned, the next version of POSIX is likely to include new ksh features though).

The reason why the Bourne shell performed parameter expansion then field splitting then globbing is that it allowed a glob to be stored in a variable: you can set a to *.txt *.pdf and then use $a to stand for the list of names of files matching *.txt followed by the list of names matching *.pdf (assuming both patterns match). (I'm not saying this is the best design possible, just that it was designed this way.) It's less clear to me why one would want command substitution to be placed at a particular step in the Bourne shell; in the Korn shell, its syntax $(…) is close to parameter expansion ${…} so it makes sense to perform them together.

The placement of tilde expansion is a historical oddity. It would have made more sense to place it later, so that you could write ~$some_user and have it expand to the home directory of the user whose name is the value of the variable some_user. I don't know why it wasn't done this way. This order even requires a special statement that the result of tilde expansion does not undergo other expansions (going by the passage you quoted, if HOME is /foo bar then ~ would expand to the two words /foo and bar due to field splitting, but no shell does that and POSIX.2008 explicitly states that “the pathname resulting from tilde expansion shall be treated as if quoted”).

There is no brace expansion in POSIX, otherwise the specification would state it.

Word expansion is only performed on WORDs, and with caveats mentioned in the following sections (e.g. field splitting and pathname generation are only performed in contexts that allow multiple words, not e.g. between double quotes). NAMEs, NEWLINEs, IO_NUMBERs and so on don't contain anything that could be expanded anyway.

Related Solutions

Word Splitting in Shell – What is Word Splitting and Its Importance in Shell Programming

Early shells had only a single data type: strings. But it is common to manipulate lists of strings, typically when passing multiple file names as arguments to a program. Another common use case for splitting is when a command outputs a list of results: the command's output is a string, but the desired data is a list of strings. To store a list of file names in a variable, you would put spaces between them. Then a shell script like this

files="foo bar qux"
myprogram $files

called myprogram with three arguments, as the shell split the string $files into words. At the time, spaces in file names were either forbidden or widely considered Not Done.

The Korn shell introduced arrays: you could store a list of strings in a variable. The Korn shell remained compatible with the then-established Bourne shell, so bare variable expansions kept undergoing word splitting, and using arrays required some syntactic overhead. You would write the snippet above

files=(foo bar qux)
myprogram "${files[@]}"

Zsh had arrays from the start, and its author opted for a saner language design at the expense of backward compatibility. In zsh (under the default expansion rules) $var does not perfom word splitting; if you want to store a list of words in a variable, you are meant to use an array; and if you really want word splitting, you can write $=var.

files=(foo bar qux)
myprogram $files

These days, spaces in file names are something you need to cope with, both because many users expect them to work and because many scripts are executed in security-sensitive contexts where an attacker may be in control of file names. So automatic word splitting is often a nuisance; hence my general advice to always use double quotes, i.e. write "$foo", unless you understand why you need word splitting in a particular use case. (Note that bare variable expansions undergo globbing as well.)

POSIX Compliance – Is Linux ARG_MAX Different from Other System Variables?

There is no standard way to retrieve the list of configuration variables that are supported on a system. If you program for a given POSIX version, the list in that version of the POSIX specification is your reference list. On Linux, getconf -a lists all available variable.

fpathconf isn't specific to PATH. It's about variables that are related to files, which are the ones that may vary from file to file.

Regarding ARG_MAX on Linux, the rationale for depending on the stack size is that the arguments end up on the stack, so there had better be enough room for them plus everything else that must fit. Most other implementations (including older versions of Linux) have a fixed size.

Most limits go together with resource availability, with different resources depending on the limit. For example, a process may be unable to open a file even if it has fewer than OPEN_MAX files open, if the system is out of memory that can be used for the file-related data.

Linux is POSIX-compliant on this point by default, so I don't know where you're getting at.

If you use ulimit -s to restrict the stack size to less than ARG_MAX, you're making the system no longer compliant. A POSIX system can typically be made non-compliant in any number of ways, including PATH=/nowhere (making all standard utilities unavailable) or rm -rf /.

The value of ARG_MAX in limits.h provides a minimum that applications can rely on. A POSIX-compliant system is allowed to let execve succeed even if the arguments exceed that size. The guarantee related to ARG_MAX is that if the arguments fit in that size then execve will not fail due E2BIG.

Best Answer

Related Solutions

Word Splitting in Shell – What is Word Splitting and Its Importance in Shell Programming

POSIX Compliance – Is Linux ARG_MAX Different from Other System Variables?

Related Question