Sum Columns with awk – How to Sum Pair of Columns Based on Matching Fields

awksorttext processinguniq

I have a large file in the following format:

2 1019 0 12 
2 1019 3 0 
2 1021 0 2 
2 1021 2 0 
2 1022 4 5
2 1030 0 1 
2 1030 5 0 
2 1031 4 4

If the values in column 2 match, I want to sum the values in column 3 and 4 of both lines, else just the sum of the values in the unique line.

So the output I am hoping for would look like this:

2 1019 15 
2 1021 4 
2 1022 9 
2 1030 6 
2 1031 8

I am able to sort files according to column 2 with awk or sort and sum the last columns with awk, but only for individual lines not for two lines where column 2 matches.

Best Answer

I would do this in Perl:

$ perl -lane '$k{"$F[0] $F[1]"}+=$F[2]+$F[3]; 
              END{print "$_ $k{$_}" for keys(%k) }' file 
2 1019 15
2 1021 4
2 1030 6
2 1031 8
2 1022 9

Or awk:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file 

If you want the output sorted according to the second column you could just pipe to sort:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file | sort -k2

Note that both solutions include the 1st column as well. The idea is to use the first and second columns as keys to a hash (in perl) or an associative array (in awk). The key in each solution is column1 column2 so if two lines have the same column two but a different column one, they will be grouped separately:

$ cat file
2 1019 2 3
2 1019 4 1
3 1019 2 2

$ awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file
3 1019 4
2 1019 10
Related Question