Linux – SQL like group by and sum for text files in command line

awkbashcommand linelinuxsed

I have huge text files with two fields, the first is a string the second is an integer. The files are sorted by the first field. What I'd like to get in the output is one line per unique string and the sum of the numbers for the identical strings. Some strings appear only once while other appear multiple times.
E.g. Given the sample data below, for the string glehnia I'd like to get 10+22=32 in the result.

Any suggestions how to do this either with gnuwin32 command line tools or in linux shell?

Thanks!

glehnia 10
glehnia 22
glehniae 343
glehnii 923
glei 1171
glei 2283
glei 3466
gleib 914
gleiber 652
gleiberg 495
gleiberg 709

Best Answer

In AWK, you could do something like this:

awk '($1 == last) || (last == "") {sum += $2}
     ($1 != last) && (last != "") {print last " " sum; sum = $2}
                                  {last = $1}
     END                          {print last " " sum}' huge_text_file.txt
Related Question