Count by first column, count distinct by second column and group output by first column

awkcsvtext processing

I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1 

Best Answer

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, '{ print $2, $1 }'
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }'
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort    | uniq -c | awk -vOFS=, '{ print $2, $1 }' ) \
            <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file        | uniq -c | awk -vOFS=, '{ print $2, $1 }' ) \
            <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, '{ print $1 }' )
A,3,2
B,2,2
C,2,1
D,1,1
Related Question