There is no need to use cat
to read the file. AWK is perfectly capable to read it.
A core c[$3]++
statement should get the count of lines of each type.
Then, at the end, just print (as tab separated values) all the counts:
#!/bin/bash
awk -F '\t' ' {c[$3]++}
END{
for (i in c) printf("%s\t%s\n",i,c[i])
}' dataset.csv
Appended
Given the comment from the OP that:
I get some issues for colums that have quotes like that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show....
In this case the parsing on \t will fail.
I got to review the answer. I created this file:
$ cat dataset.csv
1233 that doesn\'t mean that you\'re not worth remembering think of the people who need to know they need to know so you need to show... CLASS_0
1234 here CLASS_A
1235 goes the values CLASS_B
1236 "that need counting" CLASS_B
1237 "\like \this" CLASS_B
1238 \or \this CLASS_C
1239 including spaces CLASS_B
1240 but not tabs CLASS_A
1241 which could not work CLASS_B
1242 finally CLASS_C
1243 this is CLASS_A
1244 over CLASS_B
1245 988 CLASS_C
That file, when used with the script, gives the correct result:
$ ./script
CLASS_A 3
CLASS_B 6
CLASS_C 3
CLASS_0 1
Which is the correct result.
Of course, the file
- has the correct amount of
tabs
for 3 fields, and
- variables are correctly quoted when expanded and are not in upper case.
To test that a file does comply with the first requirement, you may use this script:
#!/bin/bash
filetoread="$2"
<"$filetoread" tr -dc '\t\n' |
awk '(length!=2){printf("Error in line: %s, has %s tabs\n",NR,length)}'
awk -F '\t' '(NF!=3){printf("Error in line: %s, has %s fields\n",NR,NF)}' "$filetoread"
Which checks that there are exactly two tabs per line, and
That the number of fields (as seen by awk) are actually three.
Adding a couple of test lines:
… …
1239 including spaces CLASS_B
1 but not tabs CLASS_A
2 but not \ttabs CLASS_A
1240 but not tabs CLASS_A
… …
And running the script above:
$ ./script 3 dataset.csv
Error in line: 8, has 4 tabs
Error in line: 8, has 5 fields
detects the line ID 1 that has four tabs (two added) and doesn't get fooled by line ID 2 with a \t
.
As for the quoting and use of variables, that is something you should improve all by yourself.
Best Answer
Using GNU sed:
As pointed out by Isaac, if we want to count exact 3 then do this :