Text Processing with awk – Calculate Average of Values Based on Another Field

awktext processing

Is there a way to get average of values in a field based on variables in another field? For example for the following input

a x 3
b y 4
a y 2
b x 5
b x 20

I want this output

a 2.5
b 9.67

I found this awk script to get average for values in a column

awk '{ total += $3; count++ } END { print total/count }' file.txt

but how can I add for loop in it to have the average for every variable in column 1?

The file is tab-separated.

Thank you

Best Answer

Miller is also handy for tasks like this ex.

$ mlr --nidx stats1 -a mean -f 3 -g 1 file.txt
a 2.500000
b 9.666667

or (with a more recent version that has the format-values verb)

$ mlr --nidx stats1 -a mean -f 3 -g 1 then format-values -f '%.2f' file.txt
a 2.50
b 9.67

Related Solutions

Merging 2 files with based on field match

$ awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1
aa 45 32
bb 31 15
cc 50 78

Explanation:

awk implicitly loops through each file, one line at a time. Since we gave it file2 as the first argument, it is read first. file1 is read second.

FNR==NR{a[$1]=$2;next}

NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file2. For every line in file2, we assign a[$1]=$2.

Here, a is an associative array and a[$1]=$2 means saving file2's second column, denoted $2, as a value in array a using file2's first column, $1, as the key.

next tells awk to skip the rest of the commands and start over with the next line.
($1 in a) {print $1,a[$1],$2}

If we get here, that means that we are reading the second file: file1. If we saw the first field of the line in file2, as determined by the contents of array a, then we print out a line with the values of field 2 from both files.

AWK – Count Occurrences of Column Value in TSV File

There is no need to use cat to read the file. AWK is perfectly capable to read it.

A core c[$3]++ statement should get the count of lines of each type.
Then, at the end, just print (as tab separated values) all the counts:

#!/bin/bash

awk -F '\t' '   {c[$3]++}
                 END{
                     for (i in c) printf("%s\t%s\n",i,c[i])
                 }' dataset.csv

Appended

Given the comment from the OP that:

I get some issues for colums that have quotes like that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show.... In this case the parsing on \t will fail.

I got to review the answer. I created this file:

$ cat dataset.csv 
1233    that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show...    CLASS_0
1234    here    CLASS_A
1235    goes the values CLASS_B
1236    "that need counting"    CLASS_B
1237    "\like \this"   CLASS_B
1238    \or \this       CLASS_C
1239    including spaces        CLASS_B
1240    but not tabs    CLASS_A
1241    which could not work    CLASS_B
1242    finally CLASS_C
1243    this is CLASS_A
1244    over    CLASS_B
1245    988     CLASS_C

That file, when used with the script, gives the correct result:

$ ./script
CLASS_A 3
CLASS_B 6
CLASS_C 3
CLASS_0 1

Which is the correct result.

Of course, the file

has the correct amount of tabs for 3 fields, and
variables are correctly quoted when expanded and are not in upper case.

To test that a file does comply with the first requirement, you may use this script:

#!/bin/bash

filetoread="$2"

<"$filetoread" tr -dc '\t\n' |
    awk '(length!=2){printf("Error in line: %s, has %s tabs\n",NR,length)}'

awk -F '\t' '(NF!=3){printf("Error in line: %s, has %s fields\n",NR,NF)}' "$filetoread"

Which checks that there are exactly two tabs per line, and
That the number of fields (as seen by awk) are actually three.

Adding a couple of test lines:

… …
1239    including spaces        CLASS_B
1       but not     tabs    CLASS_A
2       but not \ttabs  CLASS_A
1240    but not tabs    CLASS_A
… …

And running the script above:

$ ./script 3 dataset.csv
Error in line: 8, has 4 tabs
Error in line: 8, has 5 fields

detects the line ID 1 that has four tabs (two added) and doesn't get fooled by line ID 2 with a \t.

As for the quoting and use of variables, that is something you should improve all by yourself.

Best Answer

Related Solutions

Merging 2 files with based on field match

AWK – Count Occurrences of Column Value in TSV File

Appended

Related Question