Shell – Diff Output of Two awk Commands

awkdiff()io-redirectionshell

I'm trying to compute the difference between the output of two awk commands but my simple attempts at it seem to be failing. Here is what I'm trying:

diff $(awk '{print $3}' f1.txt | sort -u) $(awk '{print $2}' f2.txt | sort -u)

This doesn't work for reasons unknown to me. I was under the assumption that $() construct was used to capture the output of another command but my "diff" invocation fails to recognize the two inputs given to it. Is there any way I can make this work.

By the way, I can't use the obvious solution of writing the output of those two commands to separate files given that I'm logged on to a production box with no 'write' privileges.

Best Answer

diff expects the names of two files, so you should put the two output on two files, then compare them:

awk '{print $3}' f1.txt | sort -u > out1
awk '{print $2}' f2.txt | sort -u > out2
diff out1 out2

or, using ksh93, bash or zsh, you can use process substitution:

diff <(awk '{print $3}' f1.txt | sort -u) <(awk '{print $2}' f2.txt | sort -u)

How it works

The file data is provided as an argument to awk twice. Consequently, it will be read twice, the first time to get the total, which is stored in the variable s, and the second to print the output. Looking at the commands in more detail:

FNR==NR{s+=$2;next;}

NR is the total number of records (lines) that awk has read and FNR is the number of records read so far from the current file. Consequently, when FNR==NR, we are reading the first file. When this happens, the variable s is incremented by the value in the second column. Then, next tells awk to skip the rest of the commands and start over with the next record.

Note that it is not necessary to initialize s to zero. In awk, all numeric variables are, by default, initialized to zero.
printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s

If we reach this command, then we are processing the second file. This means that s now holds the total of column 2. So, we print column 1, column 2, and the percentage, 100*$2/s.

Output format options

With printf, detailed control of the output format is possible. The command above uses the %s format specifier which works for strings, integers, and floats. Three other option that might be useful here are:

%d formats numbers as integers. If the number is actually floating point, it will be truncated to an integer
%f formats numbers as floating point. It is also possible to specify widths and decimals places as, for example, %5.2f.
%e provides exponential notation. This would be useful if some numbers were exceptionally large or small.

Make a shell function

If you are going to use this more than once, it is an inconvenience to type a long command. Instead create either a function or a script to hole the command.

To create a function called totals, run the command:

$ totals() { awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"; }

With this function defined, the percentages for a data file called data can be found by running:

$ totals data

To make the definition of totals permanent, place it in your ~/.bashrc file.

Make a shell script

If you prefer a script, create a file called totals.sh with the contents:

#!/bin/sh
awk 'FNR==NR{s+=$2;next;} {printf "%s\t%s\t%s%%\n",$1,$2,100*$2/s}' "$1" "$1"

To get the percentages for a data file called data, run:

sh totals.sh data

Best Answer

Related Solutions

Shell – Passing print pattern as a variable to awk

Calculate and divide by total with AWK

How it works

Output format options

Make a shell function

Make a shell script

Related Question