Text Processing – How to Find the Most Frequent Word in a CSV File Ignoring Duplicates

sorttext processinguniq

I need to find the 10 most frequent words in a .csv file.
The file is structured so that each line contains comma-separated words. If the same word is repeated more than once in the same line, it should be counted as one.
So, in the example below:

green,blue,blue,yellow,red,yellow
red,blue,green,green,green,brown

green, blue and red should be counted as 2 and yellow and brown as 1

I know similar questions have been asked before, and one solution was:

<file.csv tr -c '[:alnum:]' '[\n*]' | sort|uniq -c|sort -nr|head  -10

But this will count the number of time a word appears in the same line, like this:

  4 green
  3 blue
  2 yellow
  2 red
  1 brown

and this is not actually what I need.
Any help? Also I will appreciate a short explanation of the command and why does the command I found in similar questions does not do what I need.

Best Answer

I would probably reach for perl

  • Use uniq from the List::Util module to de-duplicate each row.
  • Use a hash to count the resulting occurrences.

For example

perl -MList::Util=uniq -F, -lnE '
  map { $h{$_}++ } uniq @F 
  }{ 
  foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown

If you have no option except the sort and uniq coreutils, then you can implement a similar algorithm with the addition of a shell loop

while IFS=, read -a words; do 
  printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
  2 red
  2 green
  2 blue
  1 yellow
  1 brown

however please refer to Why is using a shell loop to process text considered bad practice?

Related Question