Shell – Fastest way to sum Nth column in text file

awksedshell-scripttext processing

I have a CSV file (in which the field separator is indeed comma) with 8 columns and a few million rows. Here's a sample:

1000024447,38111220,201705,181359,0,12,1,3090
1064458324,38009543,201507,9,0,1,1,1298
1064458324,38009543,201508,9,0,2,1,90017

What's the fastest way to print the sum of all numbers in a given column, as well as the number of lines read? Can you explain what makes it faster?

Best Answer

GNU datamash

$ datamash -t, count 3 sum 3 < file
3,604720

Some testing

$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000

real    0m2.851s
user    0m2.784s
sys     0m0.068s

$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000

real    0m0.967s
user    0m0.920s
sys     0m0.048s

$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000

real    0m3.394s
user    0m3.364s
sys     0m0.036s

$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000

real    0m1.679s
user    0m1.416s
sys     0m0.248s

$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000

real    0m0.815s
user    0m0.716s
sys     0m0.036s

So mawk and datamash appear to be the pick of the bunch.

Related Solutions

Extracting column from comma separated text

awk -F , -v OFS='\t' 'NR == 1 || $6 > 4 {print $1, $6, $7, $8}' input.txt

Awk search and replace string in a specific column of CSV file

awk -F '|' -v OFS='|' '$16 == "Market1" { $16 = "MarketPrime" }1' file.csv >new-file.csv

The only real issue in your code is that you set the input file separator to not just | but to spaces as well. This will make the spaces count as field separators in the data and it will be incredibly hard to figure out what the correct field number is (since some fields contain a variable number of spaces).

You also can not redirect into the same filename as you use to read from. Doing so would cause the shell to first truncate (empty) the output file, and your awk program would have no data to read.

Your code does a regular expression replacement. This is ok, but you need to be aware that if the 16th field happens to be something like Market12 or TheMarket1, it would trigger the substitution due to the missing anchor points. It would be safer to use ^Market1$ as the expression to replace, or to use a string comparison.

The awk command above uses only | as a field separator and then does a string comparison with the 16th field. If that field is Market1, it is set to MarketPrime.

The trailing 1 at the end of the awk code causes every record (modified or not) to be printed.

Best Answer

Related Solutions

Extracting column from comma separated text

Awk search and replace string in a specific column of CSV file

Related Question