Bash – Command line length limit: built-in vs executable

bashlimitposixshell

So per POSIX specification we have the following definition for *:

Expands to the positional parameters, starting from one, initially
producing one field for each positional parameter that is set. When
the expansion occurs in a context where field splitting will be
performed, any empty fields may be discarded and each of the non-empty
fields shall be further split as described in Field Splitting. When
the expansion occurs in a context where field splitting will not be
performed, the initial fields shall be joined to form a single field
with the value of each parameter separated by the first character of
the IFS variable if IFS contains at least one character, or separated
by a if IFS is unset, or with no separation if IFS is set to a
null string.

For a vast majority of people we are aware of the famous ARG_MAX limitation:

$ getconf ARG_MAX
2621440

which may lead to:

$ cat * | sort -u > /tmp/bla.txt
-bash: /bin/cat: Argument list too long

Thankfully the good people behind bash ([include all POSIX-like others]) provided us with printf as a built-in, so we can simply:

printf '%s\0' * | sort -u --files0-from=- > /tmp/bla.txt

And everything is transparent for the user.

Could someone please let me know why this is so trivial to bypass the ARG_MAX limitation using a built-in command and why it is so damn hard to provide a conforming POSIX shell interpreter which would handle gracefully * special parameter to a standalone executable:

$ cat *

Would that break something ? I am not asking bash people to provide cat as a built-in, I am solely interested in the order of operations and why is * expanded in different behavior depending whether the command is build-in or is a standalone executable.

Best Answer

The limitation is not in the shell but in the exec() family of functions.

The POSIX standard says in relation to this:

The number of bytes available for the new process' combined argument and environment lists is {ARG_MAX}. It is implementation-defined whether null terminators, pointers, and/or any alignment bytes are included in this total.

To run utilities that are built into the shell, the shell will not need to call exec(), so it is unaffected by this limitation.

Notice, too, that it's not simply the length of the command line that is limited, but the combination of the length of the command, its arguments, and the current environment variables and their values.

Also notice that printf is not a built in utility in e.g. pdksh (which happens to act as sh and ksh on OpenBSD). Relying on it being a built-in will need to take the specific shell which is being used into account.