Iterating Over a File vs Reading into Memory – Performance

bashioperformance

I'm comparing the following

tail -n 1000000 stdout.log | grep -c '"success": true'
tail -n 1000000 stdout.log | grep -c '"success": false'

with the following

log=$(tail -n 1000000 stdout.log)
echo "$log" | grep -c '"success": true'
echo "$log" | grep -c '"success": false'

and surprisingly the second takes almost 3 times longer than the first. It should be faster, shouldn't it?

Best Answer

On the one hand, the first method calls tail twice, so it has to do more work than the second method which only does this once. On the other hand, the second method has to copy the data into the shell and then back out, so it has to do more work than the first version where tail is directly piped into grep. The first method has an extra advantage on a multi-processor machine: grep can work in parallel with tail, whereas the second method is strictly serialized, first tail, then grep.

So there's no obvious reason why one should be faster than the other.

If you want to see what's going on, look at what system calls the shell makes. Try with different shells, too.

strace -t -f -o 1.strace sh -c '
  tail -n 1000000 stdout.log | grep "\"success\": true" | wc -l;
  tail -n 1000000 stdout.log | grep "\"success\": false" | wc -l'

strace -t -f -o 2-bash.strace bash -c '
  log=$(tail -n 1000000 stdout.log);
  echo "$log" | grep "\"success\": true" | wc -l;
  echo "$log" | grep "\"success\": true" | wc -l'

strace -t -f -o 2-zsh.strace zsh -c '
  log=$(tail -n 1000000 stdout.log);
  echo "$log" | grep "\"success\": true" | wc -l;
  echo "$log" | grep "\"success\": true" | wc -l'

With method 1, the main stages are:

tail reads and seek to find its starting point.
tail writes 4096-byte chunks which grep reads as fast as they're produced.
Repeat the previous step for the second search string.

With method 2, the main stages are:

tail reads and seek to find its starting point.
tail writes 4096-byte chunks which bash reads 128 bytes at a time, and zsh reads 4096 bytes at a time.
Bash or zsh writes 4096-byte chunks which grep reads as fast as they're produced.
Repeat the previous step for the second search string.

Bash's 128-byte chunks when reading the output of the command substitution slows it down significantly; zsh comes out about as fast as method 1 for me. Your mileage may vary depending on the CPU type and number, scheduler configuration, versions of the tools involved, and size of the data.

Related Solutions

The expected performance of obnam? Or: why is it so slow

Here's a good read on how to speed up obnam (may run up to 10 times faster): http://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2014-June/003086.html

Summary: add "--lru-size=1024 --upload-queue-size=512" to your commandline or config file. Note that it increases obnam's memory usage a bit.

Bash – Why is opening a file faster than reading variable content

Here, it's not about opening a file versus reading a variable's content but more about forking an extra process or not.

grep -oP '^MemFree: *\K[0-9]+' /proc/meminfo forks a process that executes grep that opens /proc/meminfo (a virtual file, in memory, no disk I/O involved) reads it and matches the regexp.

The most expensive part in that is forking the process and loading the grep utility and its library dependencies, doing the dynamic linking, open the locale database, dozens of files that are on disk (but likely cached in memory).

The part about reading /proc/meminfo is insignificant in comparison, the kernel needs little time to generate the information in there and grep needs little time to read it.

If you run strace -c on that, you'll see the one open() and one read() systems calls used to read /proc/meminfo is peanuts compared to everything else grep does to start (strace -c doesn't count the forking).

In:

a=$(</proc/meminfo)

In most shells that support that $(<...) ksh operator, the shell just opens the file and read its content (and strips the trailing newline characters). bash is different and much less efficient in that it forks a process to do that reading and passes the data to the parent via a pipe. But here, it's done once so it doesn't matter.

In:

printf '%s\n' "$a" | grep '^MemFree'

The shell needs to spawn two processes, which are running concurrently but interact between each other via a pipe. That pipe creation, tearing down, and writing and reading from it has some little cost. The much greater cost is the spawning of an extra process. The scheduling of the processes has some impact as well.

You may find that using the zsh <<< operator makes it slightly quicker:

grep '^MemFree' <<< "$a"

In zsh and bash, that's done by writing the content of $a in a temporary file, that is less expensive than spawning an extra process, but will probably not give you any gain compared to getting the data straight off /proc/meminfo. That's still less efficient than your approach that copies /proc/meminfo on disk, as the writing of the temp file is done at each iteration.

dash doesn't support here-strings, but its heredocs are implemented with a pipe that doesn't involve spawning an extra process. In:

 grep '^MemFree' << EOF
 $a
 EOF

The shell creates a pipe, forks a process. The child executes grep with its stdin as the reading end of the pipe, and the parent writes the content at the other end of the pipe.

But that pipe handling and process synchronisation is still likely to be more expensive than just getting the data straight off /proc/meminfo.

The content of /proc/meminfo is short and takes not much time to produce. If you want to save some CPU cycles, you want to remove the expensive parts: forking processes and running external commands.

Like:

IFS= read -rd '' meminfo < /proc/meminfo
memfree=${meminfo#*MemFree:}
memfree=${memfree%%$'\n'*}
memfree=${memfree#"${memfree%%[! ]*}"}

Avoid bash though whose pattern matching is very ineficient. With zsh -o extendedglob, you can shorten it to:

memfree=${${"$(</proc/meminfo)"##*MemFree: #}%%$'\n'*}

Note that ^ is special in many shells (Bourne, fish, rc, es and zsh with the extendedglob option at least), I'd recommend quoting it. Also note that echo can't be used to output arbitrary data (hence my use of printf above).

Best Answer

Related Solutions

The expected performance of obnam? Or: why is it so slow

Bash – Why is opening a file faster than reading variable content

Related Question