Count unique associated values in awk (or perl)

awkcsv-simpleperlscripting

I've already found "How to print incremental count of occurrences of unique values in column 1", which is similar to my question, but the answer isn't sufficient for my purposes.

First let me just illustrate what I want to do:

# Example input
apple   abc jkl
apple   xyz jkl
apple   abc xyz
apple   qrs xyz
apple   abc jkl
banana  abc lmno
banana  lmnop   xyz
banana  lmnopq  jkl
banana  abc jkl
banana  lmnop   pqrs
banana  abcdefg tuv
cucumber    abc lmno
cucumber    abc jkl
cucumber    abc xyz
cucumber    abcd    jkl
cucumber    abc jkl
cucumber    abc lmno

# Desired output
apple   3   2
banana  4   5
cucumber    2   3

So, for each separate value of field 1, print that field, and a count of the unique associated values for field 2, and then for field 3.

The input is sorted by the first field, but sorting by other fields is disallowed (and would do no good as the 2nd and 3rd fields both need to be handled).

I'd much rather accomplish this in awk; it is probably far easier in perl and I'm interested in learning how to do that as well, but I'm dealing with an awk script and I'd rather not rewrite the whole thing.

I came up with one method which works, but is quite lengthy and seems very hacky to me. I'll post that as an answer (when I get back to the office) but would love to see any actually good approaches. (I don't think mine is "good".)

Best Answer

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

  • function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
  • NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
  • ++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
  • ++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
  • {l=$1} The l to the first field (for the next iteration.. above)
  • END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3
Related Question