I've already found "How to print incremental count of occurrences of unique values in column 1", which is similar to my question, but the answer isn't sufficient for my purposes.
First let me just illustrate what I want to do:
# Example input
apple abc jkl
apple xyz jkl
apple abc xyz
apple qrs xyz
apple abc jkl
banana abc lmno
banana lmnop xyz
banana lmnopq jkl
banana abc jkl
banana lmnop pqrs
banana abcdefg tuv
cucumber abc lmno
cucumber abc jkl
cucumber abc xyz
cucumber abcd jkl
cucumber abc jkl
cucumber abc lmno
# Desired output
apple 3 2
banana 4 5
cucumber 2 3
So, for each separate value of field 1, print that field, and a count of the unique associated values for field 2, and then for field 3.
The input is sorted by the first field, but sorting by other fields is disallowed (and would do no good as the 2nd and 3rd fields both need to be handled).
I'd much rather accomplish this in awk
; it is probably far easier in perl and I'm interested in learning how to do that as well, but I'm dealing with an awk script and I'd rather not rewrite the whole thing.
I came up with one method which works, but is quite lengthy and seems very hacky to me. I'll post that as an answer (when I get back to the office) but would love to see any actually good approaches. (I don't think mine is "good".)
Best Answer
With
awk
:Explanation:
function p()
: defines a function calledp()
, which prints the values and deletes the used variables and arrays.NR!=1&&l!=$1
if its not the first line and the variable l equals the first field$1
, then run thep()
function.++a[$2]==1{c++}
if the increment of the element value of thea
array with index$2
equals1
, then that value is first seen, and therefore increment thec
variable. The++
before the element, returns the new value, therefore causes an increment before the comparsion with1
.++b[$3]==1{d++}
the same as above but with the 3rd field and thed
variable.{l=$1}
Thel
to the first field (for the next iteration.. above)END{p()}
after the last line is processed,awk
has to print the values for the last blockWith your given input the outout is: