I'm trying to calculate the geometric mean of a file full of numbers (1 column).
The basic formula for geometric mean is the average the natural log (or log) of all the values and then raise e (or base 10) to that value.
My current bash only script looks like this:
# Geometric Mean
count=0;
total=0;
for i in $( awk '{ print $1; }' input.txt )
do
if (( $(echo " "$i" > "0" " | bc -l) )); then
total="$(echo " "$total" + l("$i") " | bc -l )"
((count++))
else
total="$total"
fi
done
Geometric_Mean="$( printf "%.2f" "$(echo "scale=3; e( "$total" / "$count" )" | bc -l )" )"
echo "$Geometric_Mean"
Essentially:
- Check every entry in the input file to make sure it is larger than 0 calling bc every time
- If the entry is > 0, I take the natural log (l) of that value and add it to the running total calling bc every time
- If the entry is <=0, I do nothing
- Calculate the Geometric Mean
This works perfectly fine for a small data set. Unfortunately, I am trying to use this on a large data set (input.txt has 250,000 values). While I believe this will eventually work, it is extremely slow. I've never been patient enough to let it finish (45+ minutes).
I need a way of processing this file more efficiently.
There are alternative ways such as using Python
# Import the library you need for math
import numpy as np
# Open the file
# Load the lines into a list of float objects
# Close the file
infile = open('time_trial.txt', 'r')
x = [float(line) for line in infile.readlines()]
infile.close()
# Define a function called geo_mean
# Use numpy create a variable "a" with the ln of all the values
# Use numpy to EXP() the sum of all of a and divide it by the count of a
# Note ... this will break if you have values <=0
def geo_mean(x):
a = np.log(x)
return np.exp(a.sum()/len(a))
print("The Geometric Mean is: ", geo_mean(x))
I would like to avoid using Python, Ruby, Perl … etc.
Any suggestions on how to write my bash script more efficiently?
Best Answer
Please don't do this in the shell. There is no amount of tweaking that would ever make it remotely efficient. Shell loops are slow and using the shell to parse text is just bad practice. Your whole script can be replaced by this simple
awk
one-liner which will be orders of magnitude faster:For example, if I run that on a file containing the numbers from 1 to 100, I get:
In terms of speed, I tested your shell solution, your python solution and the awk I gave above on a file containing the numbers from 1 to 10000:
As you can see, the
awk
is even faster than the python and far simpler to write. You can also make it into a "shell" script, if you like. Either like this:or by saving the command in a shell script: