Count by first column, count distinct by second column and group output by first column

awkcsvtext processing

I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Best Answer

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, '{ print $2, $1 }'
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }'
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort    | uniq -c | awk -vOFS=, '{ print $2, $1 }' ) \
            <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file        | uniq -c | awk -vOFS=, '{ print $2, $1 }' ) \
            <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }' )
A,3,2
B,2,2
C,2,1
D,1,1

Related Solutions

awk – Remove Entire Row in a File if First Column is Repeated

A few ways:

awk
```
awk '!a[$1]++' file
```
This is a very condensed way of writing this:
```
awk '{if(! a[$1]){print; a[$1]++}}' file
```
So, if the current first field ($1) is not in the a array, print the line and add the 1st field to a. Next time we see that field, it will be in the array and so will not be printed.
Perl
```
perl -ane '$k{$F[0]}++ or print' file
```
or
```
perl -ane 'print if !$k{$F[0]}++' file
```
This is basically the same as the awk one. The -n causes perl to read the input file line by line and apply the script provided by -e to each line. The -a will automatically split each line on whitespace and save the resulting fields in the @F array. Finally, the first field is added to the %k hash and if it is not already there, the line is printed. The same thing could be written as
```
perl -e 'while(<>){
            @F=split(/\s+/); 
            print unless defined($k{$F[0]}); 
            $k{$F[0]}++;
         }' file
```
Coreutils
```
rev file | uniq -f 1 | rev
```
This method works by first reversing the lines in file so that if a line is 12 345 it'll now be 543 21. We then use uniq -f 1 to ignore the first field, that is to say, the column that 543 is in. There are fields within file. Using uniq here has the effect of filtering out any duplicate lines, keeping only 1 of each. Lastly we put the lines back into their original order with another reverse.
GNU sort (as suggested by @StéphaneChazelas)
```
sort -buk1,1
```
The -b flag ignores leading whitespace and the -u means print only unique fields. The clever bit is the -k1,1. The -k flag sets the field to sort on. It takes the general format of -k POS1[,POS2] which means only look at fields POS1 through POS2 when sorting. So, -k1,1 means only look at the 1st field. Depending on your data, you might want to also add one of these options:
```
 -g, --general-numeric-sort
      compare according to general numerical value
 -n, --numeric-sort
      compare according to string numerical value
```

Best Answer

Related Solutions

awk – Remove Entire Row in a File if First Column is Repeated

Related Question