Calculate and divide by total with AWK

arithmeticawk

Given the following data file…

foo     10
bar     20
oof     50
rab     20

… how would I print column two as a percent of the total of column two? In other words, I want…

foo     10    10%
bar     20    20%
oof     50    50%
rab     20    20%

… with less obvious numbers of course. I can create a running total easily enough, but I'm not sure how I can calculate the total before printing the lines. I am doing this in an awk file totals.awk

#!/usr/bin/awk -f
BEGIN{
        runningtotal=0
}
{
        runningtotal=runningtotal+$2
        print $1 "\t" $2 "\t" runningtotal "\t" $2/runningtotal
}

So, running ./totals.awk data yields…

foo     10      10      1
bar     20      30      0.666667
oof     50      80      0.625
rab     20      100     0.2

Is there a way to loop twice, once to calculate the total, and once to print the lines? Is this possible in AWK, or must I use other utilities?

Best Answer

To create the table with a single call to awk:

$ awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' data data
foo     10      10%
bar     20      20%
oof     50      50%
rab     20      20%

How it works

The file data is provided as an argument to awk twice. Consequently, it will be read twice, the first time to get the total, which is stored in the variable s, and the second to print the output. Looking at the commands in more detail:

  • FNR==NR{s+=$2;next;}

    NR is the total number of records (lines) that awk has read and FNR is the number of records read so far from the current file. Consequently, when FNR==NR, we are reading the first file. When this happens, the variable s is incremented by the value in the second column. Then, next tells awk to skip the rest of the commands and start over with the next record.

    Note that it is not necessary to initialize s to zero. In awk, all numeric variables are, by default, initialized to zero.

  • printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s

    If we reach this command, then we are processing the second file. This means that s now holds the total of column 2. So, we print column 1, column 2, and the percentage, 100*$2/s.

Output format options

With printf, detailed control of the output format is possible. The command above uses the %s format specifier which works for strings, integers, and floats. Three other option that might be useful here are:

  • %d formats numbers as integers. If the number is actually floating point, it will be truncated to an integer

  • %f formats numbers as floating point. It is also possible to specify widths and decimals places as, for example, %5.2f.

  • %e provides exponential notation. This would be useful if some numbers were exceptionally large or small.

Make a shell function

If you are going to use this more than once, it is an inconvenience to type a long command. Instead create either a function or a script to hole the command.

To create a function called totals, run the command:

$ totals() { awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"; }

With this function defined, the percentages for a data file called data can be found by running:

$ totals data

To make the definition of totals permanent, place it in your ~/.bashrc file.

Make a shell script

If you prefer a script, create a file called totals.sh with the contents:

#!/bin/sh
awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"

To get the percentages for a data file called data, run:

sh totals.sh data
Related Question