Bash – Why not use backticks with for loop

bashshell-script

Some time ago, I posted answer to some question about scripting. Someone pointed out that I shouldn't use following command:

for x in $(cat file); do something; done 

but instead of that:

while read f; do something; done < file

Usless Use of Cat article suppose to explain the whole problem, but the only explanation is:

The backticks are outright dangerous, unless you know the result of
the backticks is going to be less than or equal to how long a command
line your shell can accept. (Actually, this is a kernel limitation.
The constant ARG_MAX in your limits.h should tell you how much your
own system can take. POSIX requires ARG_MAX to be at least 4,096
bytes.)

If I correctly understood this, bash(?) should crash if I use output of very big file in command (it should exceed ARG_MAX define in limits.h file). So I checked ARG_MAX with command:

> grep ARG_MAX /usr/src/kernels/$(uname -r)/include/uapi/linux/limits.h
#define ARG_MAX       131072    /* # bytes of args + environ for exec() */

Then I created file containing text with no spaces:

> ls -l
-rw-r--r--. 1 root root 100000000 Aug 21 15:37 in_file

Then I run:

for i in $(cat in_file); do echo $i; done

aaaand nothing terrible happened.

So what should I do to check if/how this whole 'don't use cat with loop' thing is dangerous?

Best Answer

It depends what file is meant to contain. If it's meant to contain a IFS-separated list of shell globs like (assuming the default value of $IFS):

/var/log/*.log /var/adm/*~
/some/dir/*.txt

Then for i in $(cat file) would be the way to go. As that's what that unquoted $(cat file) does: apply the split+glob operator on the output of cat file stripped of its trailing newline characters. So it would loop over each filename resulting of the expansions of those globs (except in the cases where the globs don't match any file where that would leave the glob there but unexpanded).

If you wanted to loop over each delimited line of file, you'd do:

while IFS= read -r line <&3; do
{
  something with "$line"
} 3<&-
done 3< file

With a for loop, you could loop over every non-empty line with:

IFS='
' # split on newline only (actually sequences of newlines and
  # ignoring leading and trailing ones as newline is a
  # IFS whitespace character)
set -o noglob # disable the glob part of the split+glob operator:
for line in $(cat file); do
   something with "$line"
done

However a:

while read line; do
  something with "$line"
done < file

Makes little sense. That's reading the content of file in a very convoluted way where characters of $IFS and backslashes are treated specially.

In any case, the ARG_MAX limit the text you're quoting refers to is on the execve() system call (on the cumulative size of the arguments and environment variables), so only applies to cases where a command on the filesystem is being executed with the possibly very long expansion of the split+glob operator applied to the command substitution (that text is misleading and wrong on several accounts).

It would apply for instance in:

cat -- $(cat file) # with shell implementations where cat is not builtin

But not in:

for i in $(cat file)

where there's no execve() system call involved.

Compare:

bash-4.4$ echo '/*/*/*/*' > file
bash-4.4$ true $(cat file)
bash-4.4$ n=0; for f in $(cat file); do ((n++)); done; echo "$n"
523696
bash-4.4$ /bin/true $(cat file)
bash: /bin/true: Argument list too long

It's OK with bash's true builtin command or the for loop, but not when executing /bin/true. Note how the file is just 9 bytes large but the expansion of $(cat file) is several megabytes because the /*/*/*/* glob is being expanded by the shell.

More reading at:

Related Question