Shell Loop – Why It’s Bad Practice for Text Processing

shelltext processing

Is using a while loop to process text generally considered bad practice in POSIX shells?

As Stéphane Chazelas pointed out, some of the reasons for not using shell loop are conceptual, reliability, legibility, performance and security.

This answer explains the reliability and legibility aspects:

while IFS= read -r line <&3; do
  printf '%s\n' "$line"
done 3< "$InputFile"

For performance, the while loop and read are tremendously slow when reading from a file or a pipe, because the read shell built-in reads one character at a time.

How about conceptual and security aspects?

Best Answer

Yes, we see a number of things like:

while read line; do
  echo $line | cut -c3
done

Or worse:

for line in `cat file`; do
  foo=`echo $line | awk '{print $2}'`
  echo whatever $foo
done

(don't laugh, I've seen many of those).

Generally from shell scripting beginners. Those are naive literal translations of what you would do in imperative languages like C or python, but that's not how you do things in shells, and those examples are very inefficient, completely unreliable (potentially leading to security issues), and if you ever manage to fix most of the bugs, your code becomes illegible.

Conceptually

In C or most other languages, building blocks are just one level above computer instructions. You tell your processor what to do and then what to do next. You take your processor by the hand and micro-manage it: you open that file, you read that many bytes, you do this, you do that with it.

Shells are a higher level language. One may say it's not even a language. They're before all command line interpreters. The job is done by those commands you run and the shell is only meant to orchestrate them.

One of the great things that Unix introduced was the pipe and those default stdin/stdout/stderr streams that all commands handle by default.

In 50 years, we've not found better than that API to harness the power of commands and have them cooperate to a task. That's probably the main reason why people are still using shells today.

You've got a cutting tool and a transliterate tool, and you can simply do:

cut -c4-5 < in | tr a b > out

The shell is just doing the plumbing (open the files, setup the pipes, invoke the commands) and when it's all ready, it just flows without the shell doing anything. The tools do their job concurrently, efficiently at their own pace with enough buffering so as not one blocking the other, it's just beautiful and yet so simple.

Invoking a tool though has a cost (and we'll develop that on the performance point). Those tools may be written with thousands of instructions in C. A process has to be created, the tool has to be loaded, initialised, then cleaned-up, process destroyed and waited for.

Invoking cut is like opening the kitchen drawer, take the knife, use it, wash it, dry it, put it back in the drawer. When you do:

while read line; do
  echo $line | cut -c3
done < file

It's like for each line of the file, getting the read tool from the kitchen drawer (a very clumsy one because it's not been designed for that), read a line, wash your read tool, put it back in the drawer. Then schedule a meeting for the echo and cut tool, get them from the drawer, invoke them, wash them, dry them, put them back in the drawer and so on.

Some of those tools (read and echo) are built in most shells, but that hardly makes a difference here since echo and cut still need to be run in separate processes.

It's like cutting an onion but washing your knife and put it back in the kitchen drawer between each slice.

Here the obvious way is to get your cut tool from the drawer, slice your whole onion and put it back in the drawer after the whole job is done.

IOW, in shells, especially to process text, you invoke as few utilities as possible and have them cooperate to the task, not run thousands of tools in sequence waiting for each one to start, run, clean up before running the next one.

Further reading in Bruce's fine answer. The low-level text processing internal tools in shells (except maybe for zsh) are limited, cumbersome, and generally not fit for general text processing.

Performance

As said earlier, running one command has a cost. A huge cost if that command is not builtin, but even if they are builtin, the cost is big.

And shells have not been designed to run like that, they have no pretension to being performant programming languages. They are not, they're just command line interpreters. So, little optimisation has been done on this front.

Also, the shells run commands in separate processes. Those building blocks don't share a common memory or state. When you do a fgets() or fputs() in C, that's a function in stdio. stdio keeps internal buffers for input and output for all the stdio functions, to avoid to do costly system calls too often.

The corresponding even builtin shell utilities (read, echo, printf) can't do that. read is meant to read one line. If it reads past the newline character, that means the next command you run will miss it. So read has to read the input one byte at a time (some implementations have an optimisation if the input is a regular file in that they read chunks and seek back, but that only works for regular files and bash for instance only reads 128 byte chunks which is still a lot less than text utilities will do).

Same on the output side, echo can't just buffer its output, it has to output it straight away because the next command you run will not share that buffer.

Obviously, running commands sequentially means you have to wait for them, it's a little scheduler dance that gives control from the shell and to the tools and back. That also means (as opposed to using long running instances of tools in a pipeline) that you cannot harness several processors at the same time when available.

Between that while read loop and the (supposedly) equivalent cut -c3 < file, in my quick test, there's a CPU time ratio of around 40000 in my tests (one second versus half a day). But even if you use only shell builtins:

while read line; do
  echo ${line:2:1}
done

(here with bash), that's still around 1:600 (one second vs 10 minutes).

Reliability/legibility

It's very hard to get that code right. The examples I gave are seen too often in the wild, but they have many bugs.

read is a handy tool that can do many different things. It can read input from the user, split it into words to store in different variables. read line does not read a line of input, or maybe it reads a line in a very special way. It actually reads words from the input those words separated by $IFS and where backslash can be used to escape the separators or the newline character.

With the default value of $IFS, on an input like:

   foo\/bar \
baz
biz

read line will store "foo/bar baz" into $line, not " foo\/bar \" as you'd expect.

To read a line, you actually need:

IFS= read -r line

That's not very intuitive, but that's the way it is, remember shells were not meant to be used like that.

Same for echo. echo expands sequences. You can't use it for arbitrary contents like the content of a random file. You need printf here instead.

And of course, there's the typical forgetting of quoting your variable which everybody falls into. So it's more:

while IFS= read -r line; do
  printf '%s\n' "$line" | cut -c3
done < file

Now, a few more caveats:

except for zsh, that doesn't work if the input contains NUL characters while at least GNU text utilities would not have the problem.
if there's data after the last newline, it will be skipped
inside the loop, stdin is redirected so you need to pay attention that the commands in it don't read from stdin.
for the commands within the loops, we're not paying attention to whether they succeed or not. Usually, error (disk full, read errors...) conditions will be poorly handled, usually more poorly than with the correct equivalent.

If we want to address some of those issues above, that becomes:

while IFS= read -r line <&3; do
  {
    printf '%s\n' "$line" | cut -c3 || exit
  } 3<&-
done 3< file
if [ -n "$line" ]; then
    printf '%s' "$line" | cut -c3 || exit
fi

That's becoming less and less legible.

There are a number of other issues with passing data to commands via the arguments or retrieving their output in variables:

the limitation on the size of arguments (some text utility implementations have a limit there as well, though the effect of those being reached are generally less problematic)
the NUL character (also a problem with text utilities).
arguments taken as options when they start with - (or + sometimes)
various quirks of various commands typically used in those loops like expr, test...
the (limited) text manipulation operators of various shells that handle multi-byte characters in inconsistent ways.
...

Security considerations

When you start working with shell variables and arguments to commands, you're entering a mine-field.

If you forget to quote your variables, forget the end of option marker, work in locales with multi-byte characters (the norm these days), you're certain to introduce bugs which sooner or later will become vulnerabilities.

When you may want to use loops.

TBD

Best Answer

Conceptually

Performance

Reliability/legibility

Security considerations

When you may want to use loops.

Related Solutions

Is Changing Root’s Default Shell a Bad Practice?

Bash – “Syntax error: Bad for loop variable” when trying to run awk in a loop in a shell script

Related Question