Echo vs Cat – Understanding Execution Time Differences

catcommand-substitutionechoquotingshell

Answering this question caused me to ask another question:
I thought the following scripts do the same thing and the second one should be much faster, because the first one uses cat that needs to open the file over and over but the second one opens the file only one time and then just echoes a variable:

(See update section for correct code.)

First:

#!/bin/sh
for j in seq 10; do
  cat input
done >> output

Second:

#!/bin/sh
i=`cat input`
for j in seq 10; do
  echo $i
done >> output

while input is about 50 megabytes.

But when I tried the second one, it was too ,too slow because echoing the variable i was a massive process. I also got some problems with the second script, for example the size of output file was lower than expected.

I also checked the man page of echo and cat to compare them:

echo – display a line of text

cat – concatenate files and print on the standard output

But I didn't get the difference.

So:

Why cat is so fast and echo is so slow in the second script?
Or is the problem with variable i ? ( because in the man page of
echo it is said it displays "a line of text" and so I guess it is
optimized only for short variables, not for very very long variables
like i. However, that is only a guess.)
And why I got problems when I use echo?

UPDATE

I used seq 10 instead of `seq 10` incorrectly. This is edited code:

First:

#!/bin/sh
for j in `seq 10`; do
  cat input
done >> output

Second:

#!/bin/sh
i=`cat input`
for j in `seq 10`; do
  echo $i
done >> output

(Special thanks to roaima.)

However, it is not the point of the problem. Even if the loop occurs only one time, I get the same problem:cat works much faster than echo.

Best Answer

There are several things to consider here.

i=`cat input`

can be expensive and there's a lot of variations between shells.

That's a feature called command substitution. The idea is to store the whole output of the command minus the trailing newline characters into the i variable in memory.

To do that, shells fork the command in a subshell and read its output through a pipe or socketpair. You see a lot of variation here. On a 50MiB file here, I can see for instance bash being 6 times as slow as ksh93 but slightly faster than zsh and twice as fast as yash.

The main reason for bash being slow is that it reads from the pipe 128 bytes at a time (while other shells read 4KiB or 8KiB at a time) and is penalised by the system call overhead.

zsh needs to do some post-processing to escape NUL bytes (other shells break on NUL bytes), and yash does even more heavy-duty processing by parsing multi-byte characters.

All shells need to strip the trailing newline characters which they may be doing more or less efficiently.

Some may want to handle NUL bytes more gracefully than others and check for their presence.

Then once you have that big variable in memory, any manipulation on it generally involves allocating more memory and coping data across.

Here, you're passing (were intending to pass) the content of the variable to echo.

Luckily, echo is built-in in your shell, otherwise the execution would have likely failed with an arg list too long error. Even then, building the argument list array will possibly involve copying the content of the variable.

The other main problem in your command substitution approach is that you're invoking the split+glob operator (by forgetting to quote the variable).

For that, shells need to treat the string as a string of characters (though some shells don't and are buggy in that regard) so in UTF-8 locales, that means parsing UTF-8 sequences (if not done already like yash does), look for $IFS characters in the string. If $IFS contains space, tab or newline (which is the case by default), the algorithm is even more complex and expensive. Then, the words resulting from that splitting need to be allocated and copied.

The glob part will be even more expensive. If any of those words contain glob characters (*, ?, [), then the shell will have to read the content of some directories and do some expensive pattern matching (bash's implementation for instance is notoriously very bad at that).

If the input contains something like /*/*/*/../../../*/*/*/../../../*/*/*, that will be extremely expensive as that means listing thousands of directories and that can expand to several hundred MiB.

Then echo will typically do some extra processing. Some implementations expand \x sequences in the argument it receives, which means parsing the content and probably another allocation and copy of the data.

On the other hand, OK, in most shells cat is not built-in, so that means forking a process and executing it (so loading the code and the libraries), but after the first invocation, that code and the content of the input file will be cached in memory. On the other hand, there will be no intermediary. cat will read large amounts at a time and write it straight away without processing, and it doesn't need to allocate huge amount of memory, just that one buffer that it reuses.

It also means that it's a lot more reliable as it doesn't choke on NUL bytes and doesn't trim trailing newline characters (and doesn't do split+glob, though you can avoid that by quoting the variable, and doesn't expand escape sequence though you can avoid that by using printf instead of echo).

If you want to optimise it further, instead of invoking cat several times, just pass input several times to cat.

yes input | head -n 100 | xargs cat

Will run 3 commands instead of 100.

To make the variable version more reliable, you'd need to use zsh (other shells can't cope with NUL bytes) and do it:

zmodload zsh/mapfile
var=$mapfile[input]
repeat 10 print -rn -- "$var"

If you know the input doesn't contain NUL bytes, then you can reliably do it POSIXly (though it may not work where printf is not builtin) with:

i=$(cat input && echo .) || exit # add an extra .\n to avoid trimming newlines
i=${i%.} # remove that trailing dot (the \n was removed by cmdsubst)
n=10
while [ "$n" -gt 10 ]; do
  printf %s "$i"
  n=$((n - 1))
done

But that is never going to be more efficient than using cat in the loop (unless the input is very small).

Related Solutions

Shell – Why is echo ignoring the quote characters

Your shell is interpreting the quotes, both ' and ", before they even get to echo. I generally just put double quotes around my argument to echo even if they're unnecessary; for example:

$ echo "Hello world"
Hello world

So in your first example, if you want to include literal quote marks in your output, they either need to be escaped:

$ echo \'Hello world\'
'Hello world'

Or they need to be used within a quoted argument already (but it can't be the same kind of quote, or you'll need to escape it anyway):

$ echo "'Hello world'"
'Hello world'

$ echo '"Hello world"'
"Hello world"

In your second example, you're running a command substitution in the middle of the string:

grep  $ARG  /var/tmp/setfile  | awk {print $2}

Things that start with $ are also handled specially by the shell -- it treats them as variables and replaces them with their values. Since most likely neither of those variables are set in your shell, it actually just runs

grep /var/tmp/setfile | awk {print}

Since grep only sees one argument, it assumes that argument is the pattern you're searching for, and that the place it should read data from is stdin, so it blocks waiting for input. That's why your second command appears to just hang.

This won't happen if you single-quote the argument (which is why your first example nearly worked), so this is one way to get the output you want:

echo \'' echo PARAM=`  grep  $ARG  /var/tmp/setfile  | awk '{print $2}' `    '\'

You can also double-quote it, but then you'll need to escape the $s so the shell doesn't resolve them as variables, and the backticks so the shell doesn't run the command substitution right away:

echo "' echo PARAM=\`  grep  \$ARG  /var/tmp/setfile  | awk '{print \$2}' \`    '"

Unexpected Results Testing Serial Loopback Using echo and cat

Thanks to the second comment by Bruce, I was able to figure out the problem on my own.

After running stty -a -F /dev/ttyS1, there were 3 options I found to contribute to the problem: "echo", "onlcr", and "icrnl".

Since this serial port is looped back to itself, here is what happened after running echo "hi" > /dev/ttyS1:

The echo command appends a newline to the end of the message by default, so "hi" + LF is sent out to /dev/ttyS1
Because "onlcr" was set, the serial device converted the LF to CRLF so the physical message sent out the Tx line was "hi" + CRLF
Because "icrnl" was set, the physical messaged received on the Rx line converted the CR to LF. So the message outputted by 'cat' was "hi" + LFLF.
Because "echo" was set, the message received on the Rx ("hi" + LFLF), was then sent back out on the Tx line.
Because of onlcr, "hi" + LFLF became "hi" + CRLFCRLF.
Because of icrnl, "hi" + CRLFCRLF became "hi" + LFLFLFLF
Because of echo, "hi" + LFLFLFLF was then sent out the Tx

And so on...

In order to fix this problem, I ran the following command:

stty -F /dev/ttyS1 -echo -onlcr

Disabling "echo" prevents an infinite loop of messages and disabling "onlcr" prevents the serial device from converting LF to CRLF on output. Now cat receives one "hi" (with a single newline!) for each time I run echo.

CR = carriage return (ASCII 0x0D); LF = line feed or newline (ASCII 0x0A)

UPDATE

Best Answer

Related Solutions

Shell – Why is echo ignoring the quote characters

Unexpected Results Testing Serial Loopback Using echo and cat

Related Question