Count unique associated values in awk (or perl)

awkcsv-simpleperlscripting

I've already found "How to print incremental count of occurrences of unique values in column 1", which is similar to my question, but the answer isn't sufficient for my purposes.

First let me just illustrate what I want to do:

# Example input
apple   abc jkl
apple   xyz jkl
apple   abc xyz
apple   qrs xyz
apple   abc jkl
banana  abc lmno
banana  lmnop   xyz
banana  lmnopq  jkl
banana  abc jkl
banana  lmnop   pqrs
banana  abcdefg tuv
cucumber    abc lmno
cucumber    abc jkl
cucumber    abc xyz
cucumber    abcd    jkl
cucumber    abc jkl
cucumber    abc lmno

# Desired output
apple   3   2
banana  4   5
cucumber    2   3

So, for each separate value of field 1, print that field, and a count of the unique associated values for field 2, and then for field 3.

The input is sorted by the first field, but sorting by other fields is disallowed (and would do no good as the 2nd and 3rd fields both need to be handled).

I'd much rather accomplish this in awk; it is probably far easier in perl and I'm interested in learning how to do that as well, but I'm dealing with an awk script and I'd rather not rewrite the whole thing.

I came up with one method which works, but is quite lengthy and seems very hacky to me. I'll post that as an answer (when I get back to the office) but would love to see any actually good approaches. (I don't think mine is "good".)

Best Answer

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
{l=$1} The l to the first field (for the next iteration.. above)
END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3

Related Solutions

How to print incremental count of occurrences of unique values in column 1

The standard trick for this kind of problem in Awk is to use an associative counter array:

awk '{ print $0 "\t" ++count[$1] }'

This counts the number of times the first word in each line has been seen. It's not quite what you're asking for, since

Apple_1   1      300
Apple_2   1      500
Apple_1   500    1500

would produce

Apple_1   1      300     1
Apple_2   1      500     1
Apple_1   500    1500    2

(the count for Apple_1 isn't reset when we see Apple_2), but if the input is sorted you'll be OK.

Otherwise you'd need to track a counter and last-seen key:

awk '{ if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 "\t" counter }'

Bash – How to call bash function from within awk

You may be able to do what you want by piping awk's output into a while read loop. For example:

awk '/^#/ {next}; NF == 0 {next}; NF != 4 {exit 1} ; {print}' | 
    while read NAME METHOD URL TAG ; do
        :  # do stuff with $NAME, $METHOD, $URL, $TAG
        echo "$NAME:$METHOD:$URL:$TAG"
    done

if [ "$PIPESTATUS" -eq 1 ] ; then
    : # do something to handle awk's exit code
fi

Tested with:

$ cat input.txt 
# comment
NAME METHOD URL TAG
a b c d
1 2 3 4
x y z
a b c d

$ ./testawk.sh <input.txt 
NAME:METHOD:URL:TAG
a:b:c:d
1:2:3:4

Note that it correctly exits on the fifth x y z input line.

Best Answer

Related Solutions

How to print incremental count of occurrences of unique values in column 1

Bash – How to call bash function from within awk

Related Question