Efficient Bash Calculations – Best Practices

bashperformanceshellshell-script

I'm trying to calculate the geometric mean of a file full of numbers (1 column).

The basic formula for geometric mean is the average the natural log (or log) of all the values and then raise e (or base 10) to that value.

My current bash only script looks like this:

# Geometric Mean
count=0;
total=0; 

for i in $( awk '{ print $1; }' input.txt )
  do
    if (( $(echo " "$i" > "0" " | bc -l) )); then
        total="$(echo " "$total" + l("$i") " | bc -l )"
        ((count++))
    else
      total="$total"
    fi
  done

Geometric_Mean="$( printf "%.2f" "$(echo "scale=3; e( "$total" / "$count" )" | bc -l )" )"
echo "$Geometric_Mean"

Essentially:

Check every entry in the input file to make sure it is larger than 0 calling bc every time
If the entry is > 0, I take the natural log (l) of that value and add it to the running total calling bc every time
If the entry is <=0, I do nothing
Calculate the Geometric Mean

This works perfectly fine for a small data set. Unfortunately, I am trying to use this on a large data set (input.txt has 250,000 values). While I believe this will eventually work, it is extremely slow. I've never been patient enough to let it finish (45+ minutes).

I need a way of processing this file more efficiently.

There are alternative ways such as using Python

# Import the library you need for math
import numpy as np

# Open the file
# Load the lines into a list of float objects
# Close the file
infile = open('time_trial.txt', 'r')
x = [float(line) for line in infile.readlines()]
infile.close()

# Define a function called geo_mean
# Use numpy create a variable "a" with the ln of all the values
# Use numpy to EXP() the sum of all of a and divide it by the count of a
# Note ... this will break if you have values <=0
def geo_mean(x):
    a = np.log(x)
    return np.exp(a.sum()/len(a))

print("The Geometric Mean is: ", geo_mean(x))

I would like to avoid using Python, Ruby, Perl … etc.

Any suggestions on how to write my bash script more efficiently?

Best Answer

Please don't do this in the shell. There is no amount of tweaking that would ever make it remotely efficient. Shell loops are slow and using the shell to parse text is just bad practice. Your whole script can be replaced by this simple awk one-liner which will be orders of magnitude faster:

awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file

For example, if I run that on a file containing the numbers from 1 to 100, I get:

$ seq 100 > file
$ awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file
37.99

In terms of speed, I tested your shell solution, your python solution and the awk I gave above on a file containing the numbers from 1 to 10000:

## Shell
$ time foo.sh
3677.54

real    1m0.720s
user    0m48.720s
sys     0m24.733s

### Python
$ time foo.py
The Geometric Mean is:  3680.827182220091

real    0m0.149s
user    0m0.121s
sys     0m0.027s


### Awk
$ time awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' input.txt
3680.83

real    0m0.011s
user    0m0.010s
sys     0m0.001s

As you can see, the awk is even faster than the python and far simpler to write. You can also make it into a "shell" script, if you like. Either like this:

#!/bin/awk -f

BEGIN{
    E = exp(1);
} 
$1>0{
    tot+=log($1);
    c++;
}
 
END{
    m=tot/c; printf "%.2f\n", E^m
}

or by saving the command in a shell script:

#!/bin/sh
awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++;} END{m=tot/c; printf "%.2f\n", E^m}' "$1"

Related Solutions

Shell – Efficient way of comparing in awk

I believe this is the most efficient way to do this in awk:

awk 'NR == FNR {
  orderIds[$0]; next
  }
!($0 in orderIds)
  ' ids.log.remote ids.log.local

You may try with grep too:

grep -xFVf ids.log.remote ids.log.local

Shell – Efficient way to create multiple files

The limitation is on the size of the arguments upon execution of a command. So the options are to execute a command with fewer arguments, for instance with xargs to run smaller batches, increase the limit (ulimit -s 100000 on Linux), not execute anything (do it all in the shell), or build the list in the tool that creates the files.

zsh, ksh93, bash:

printf '%s ' {1..1391803} | xargs touch

printf is builtin, so there's no exec, so the limit is not reached. xargs splits the list of args passed to touch to avoid breaking the limit. That's still not very efficient as the shell has to first create the whole list (slow especially with bash), store it in memory, and then print it.

seq 1391803 | xargs touch

(assuming you have a seq command) would be more efficient.

for ((i=1; i<=1391803; i++)); do : >> "$i"; done

Everything is done in the shell, no big list stored in memory. Should be relatively efficient except maybe with bash.

POSIXly:

i=1; while [ "$i" -le 1391803 ]; do : >> "$i"; i=$(($i + 1)); done

echo 'for (i=1;i<=1391803;i++) i' | bc | xargs touch

awk 'BEGIN {for (i=1; i<=1391803; i++) {printf "" >> i; close(i)}}'

Best Answer

Related Solutions

Shell – Efficient way of comparing in awk

Shell – Efficient way to create multiple files

Related Question