Efficient Bash Calculations – Best Practices

bashperformanceshellshell-script

I'm trying to calculate the geometric mean of a file full of numbers (1 column).

The basic formula for geometric mean is the average the natural log (or log) of all the values and then raise e (or base 10) to that value.

My current bash only script looks like this:

# Geometric Mean
count=0;
total=0; 

for i in $( awk '{ print $1; }' input.txt )
  do
    if (( $(echo " "$i" > "0" " | bc -l) )); then
        total="$(echo " "$total" + l("$i") " | bc -l )"
        ((count++))
    else
      total="$total"
    fi
  done

Geometric_Mean="$( printf "%.2f" "$(echo "scale=3; e( "$total" / "$count" )" | bc -l )" )"
echo "$Geometric_Mean"

Essentially:

  1. Check every entry in the input file to make sure it is larger than 0 calling bc every time
  2. If the entry is > 0, I take the natural log (l) of that value and add it to the running total calling bc every time
  3. If the entry is <=0, I do nothing
  4. Calculate the Geometric Mean

This works perfectly fine for a small data set. Unfortunately, I am trying to use this on a large data set (input.txt has 250,000 values). While I believe this will eventually work, it is extremely slow. I've never been patient enough to let it finish (45+ minutes).

I need a way of processing this file more efficiently.

There are alternative ways such as using Python

# Import the library you need for math
import numpy as np

# Open the file
# Load the lines into a list of float objects
# Close the file
infile = open('time_trial.txt', 'r')
x = [float(line) for line in infile.readlines()]
infile.close()

# Define a function called geo_mean
# Use numpy create a variable "a" with the ln of all the values
# Use numpy to EXP() the sum of all of a and divide it by the count of a
# Note ... this will break if you have values <=0
def geo_mean(x):
    a = np.log(x)
    return np.exp(a.sum()/len(a))

print("The Geometric Mean is: ", geo_mean(x))

I would like to avoid using Python, Ruby, Perl … etc.

Any suggestions on how to write my bash script more efficiently?

Best Answer

Please don't do this in the shell. There is no amount of tweaking that would ever make it remotely efficient. Shell loops are slow and using the shell to parse text is just bad practice. Your whole script can be replaced by this simple awk one-liner which will be orders of magnitude faster:

awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file

For example, if I run that on a file containing the numbers from 1 to 100, I get:

$ seq 100 > file
$ awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' file
37.99

In terms of speed, I tested your shell solution, your python solution and the awk I gave above on a file containing the numbers from 1 to 10000:

## Shell
$ time foo.sh
3677.54

real    1m0.720s
user    0m48.720s
sys     0m24.733s

### Python
$ time foo.py
The Geometric Mean is:  3680.827182220091

real    0m0.149s
user    0m0.121s
sys     0m0.027s


### Awk
$ time awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++} END{m=tot/c; printf "%.2f\n", E^m}' input.txt
3680.83

real    0m0.011s
user    0m0.010s
sys     0m0.001s

As you can see, the awk is even faster than the python and far simpler to write. You can also make it into a "shell" script, if you like. Either like this:

#!/bin/awk -f

BEGIN{
    E = exp(1);
} 
$1>0{
    tot+=log($1);
    c++;
}
 
END{
    m=tot/c; printf "%.2f\n", E^m
}

or by saving the command in a shell script:

#!/bin/sh
awk 'BEGIN{E = exp(1);} $1>0{tot+=log($1); c++;} END{m=tot/c; printf "%.2f\n", E^m}' "$1"
Related Question