I need a Unix command that will read a CSV file (over 700M rows) with the sample below:
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Best Answer
To get the first two columns of the output:
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The
awk
at the end just swaps the columns and inserts a comma in-between them.The final column may be had with
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The
awk
at the end extracts the counts only.Combining these using
bash
andpaste
:If you pre-sort the data, this may be shortened slightly (and sped up considerably):