How to print incremental count of occurrences of unique values in column 1

awkperl

I'm trying to come up with an solution to this problem, I need to incrementally count and then print the counts of the unique values in column 1 of a tab delimited text file. Here is an example:

Apple_1   1      300
Apple_2   1      500
Apple_2   500    1500
Apple_2   1500   2450
Apple_3   1      1250
Apple_3   1250   2000

And the desired output is:

Apple_1   1      300     1
Apple_2   1      500     1
Apple_2   500    1500    2
Apple_2   1500   2450    3
Apple_3   1      1250    1
Apple_3   1250   2000    2

I know that I can print the line number in awk with just print NR, but I don't know how to reset it for each unique value of column 1.

Thanks for any help you can offer, I appreciate it.

Best Answer

The standard trick for this kind of problem in Awk is to use an associative counter array:

awk '{ print $0 "\t" ++count[$1] }'

This counts the number of times the first word in each line has been seen. It's not quite what you're asking for, since

Apple_1   1      300
Apple_2   1      500
Apple_1   500    1500

would produce

Apple_1   1      300     1
Apple_2   1      500     1
Apple_1   500    1500    2

(the count for Apple_1 isn't reset when we see Apple_2), but if the input is sorted you'll be OK.

Otherwise you'd need to track a counter and last-seen key:

awk '{ if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 "\t" counter }'

Related Solutions

Count unique associated values in awk (or perl)

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
{l=$1} The l to the first field (for the next iteration.. above)
END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3

AWK – Count Occurrences of Column Value in TSV File

There is no need to use cat to read the file. AWK is perfectly capable to read it.

A core c[$3]++ statement should get the count of lines of each type.
Then, at the end, just print (as tab separated values) all the counts:

#!/bin/bash

awk -F '\t' '   {c[$3]++}
                 END{
                     for (i in c) printf("%s\t%s\n",i,c[i])
                 }' dataset.csv

Appended

Given the comment from the OP that:

I get some issues for colums that have quotes like that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show.... In this case the parsing on \t will fail.

I got to review the answer. I created this file:

$ cat dataset.csv 
1233    that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show...    CLASS_0
1234    here    CLASS_A
1235    goes the values CLASS_B
1236    "that need counting"    CLASS_B
1237    "\like \this"   CLASS_B
1238    \or \this       CLASS_C
1239    including spaces        CLASS_B
1240    but not tabs    CLASS_A
1241    which could not work    CLASS_B
1242    finally CLASS_C
1243    this is CLASS_A
1244    over    CLASS_B
1245    988     CLASS_C

That file, when used with the script, gives the correct result:

$ ./script
CLASS_A 3
CLASS_B 6
CLASS_C 3
CLASS_0 1

Which is the correct result.

Of course, the file

has the correct amount of tabs for 3 fields, and
variables are correctly quoted when expanded and are not in upper case.

To test that a file does comply with the first requirement, you may use this script:

#!/bin/bash

filetoread="$2"

<"$filetoread" tr -dc '\t\n' |
    awk '(length!=2){printf("Error in line: %s, has %s tabs\n",NR,length)}'

awk -F '\t' '(NF!=3){printf("Error in line: %s, has %s fields\n",NR,NF)}' "$filetoread"

Which checks that there are exactly two tabs per line, and
That the number of fields (as seen by awk) are actually three.

Adding a couple of test lines:

… …
1239    including spaces        CLASS_B
1       but not     tabs    CLASS_A
2       but not \ttabs  CLASS_A
1240    but not tabs    CLASS_A
… …

And running the script above:

$ ./script 3 dataset.csv
Error in line: 8, has 4 tabs
Error in line: 8, has 5 fields

detects the line ID 1 that has four tabs (two added) and doesn't get fooled by line ID 2 with a \t.

As for the quoting and use of variables, that is something you should improve all by yourself.

Best Answer

Related Solutions

Count unique associated values in awk (or perl)

AWK – Count Occurrences of Column Value in TSV File

Appended

Related Question