Ubuntu – Print unique words, total number of occurrences and sum using `awk`

awkcommand linetext processing

How can I print unique words, number of their occurrences and the sum of their values in the relevant column using a single array in awk?

I'm using awk like:

awk -F, '{sum[$1]+=$2} END{for (x in sum) print x, sum[x]}' inFile

Can I modify the command above to print the total number of occurrences of unique words as well? Something like the below result for the following sample input:

Result (the order of the printed results doesn't matter):

A 2 25 
B 1 12 
C 3 18

Input:

A,15
C,13
C,4
A,10
B,12
C,1

I can add another array to count them separately but I think there should be another way to print it just using the same array.

Is there any index of the array sum which stores the total words seen?

Best Answer

This should do:

awk -F, '{x[$1]["count"]++;x[$1]["sum"]+=$2}END{for(y in x){print y,x[y]["count"],x[y]["sum"]}}' in

Basically you replace the array with a multidimensional array in order to store both the count of the occurences of each unique first field and the sum of their relative second fields.

% cat in
A,15
C,13
C,4
A,10
B,12
C,1
% awk -F, '{x[$1]["count"]++;x[$1]["sum"]+=$2}END{for(y in x){print y,x[y]["count"],x[y]["sum"]}}' in
A 2 25
B 1 12
C 3 18

#1

Golfed one-liner:

perl -lane 'my $s;my @m=$F[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@F,$s);print(join(",",@F))' infile

Expanded full script:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");

while(<$in>) {
    my $string;
    my @fields = split(" ");
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($in);

exit

Explanation:

The input file is processed line by line;
Each line is splitted into two strings, the part before the space and the part after the space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;

Sample output:

% cat infile
c32_g1_i1_3 GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
c32_g1_i1_6 AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl -ne 'my $s;my @f=split(" ");my @m=$f[1]=~/C.?/g;foreach(@m){$_ eq"CC"?$s.="C":$s.="C#"}push(@f,$s);print(join(",",@f)."\n")' infile
c32_g1_i1_3,GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#
c32_g1_i1_6,AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#

#2

Expanded full version:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 1 || die("Usage: <command> <input_file>\n");

open(my $in, $ARGV[0]) || die("Could not open input file \"$ARGV[0]\": $!\n");
open(my $tmp, "+>", "tmpfile") || die("Could not create temporary file \"tmpfile\": $!\n");

select($tmp);

while(<$in>) {
    if(/^>/) {
        s/$/ /
    }
    if(my $next = <$in>) {
        if($next !~ /^>/) {
            chomp
        }
        print;
        seek($in, -length($next), 1)
    }
    else {
        print
    }
}

close($in);

seek($tmp, 0, 0);

select(STDOUT);

while(<$tmp>) {
    my $string;
    my @fields = split(/ (?!.* )|\n/);
    my @matches = $fields[1] =~ /C.?/g;
    foreach(@matches) {
        $_ eq "CC" ? $string .= "C" : $string .= "C#"
    }
    push(@fields, $string);
    print(join(",", @fields) . "\n")
}

close($tmp);

unlink("tmpfile");

exit

Explanation:

The input file is processed line by line;
If the current line starts with a > character, a space is appended to the line; if a following line exists and doesn't start with a > character, the newline character is stripped from the current line; the current line is printed to a temporary file;
The temporary file is processed line by line;
Each line is splitted into two strings, the part before the last space and the part after the last space;
Each substring made of a "C" character optionally followed by another character (optionally to catch also a "C" character at the end of the string) in the second string is evaluated, and if the character following the "C" is a "C", "C" is appended to the end of a custom temporary string; otherwise "C#" is appended at the end of the custom temporary string;
The first, second and custom temporary string are printed, comma-separated, followed by a newline;
The temporary file is removed;

Sample output:

% cat infile 
>c32_g1_i1_
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS
>c32_g1_i1_
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX
% perl script.pl infile 
>c32_g1_i1_,3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS3GQKAKLKVPVFFLHRRGSICSSFYLMFSFEIKKK*TSKN*CFVCVRVRNRERAGVKCAHVYCPMFNGTQTH*IIISSLNS,C#C#C#C#C#C#C#C#C#C#
>c32_g1_i1_,6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX6AV*TADDDLVRLCSIEHGTIHMCTLYTCCTLTVTHTYTHKTLIFACLFFFNFKGEHQIERAANRTSSM*KKHRNF*LGLLAX,C#C#CC#C#C#C#CC#C#

Best Answer

Related Solutions

Ubuntu – Count the number of unique values based on two columns in a spreadsheet

Ubuntu – Printing pattern of “C” character

#1

#2

Related Question