Text Processing – How to Find the Most Frequent Word in a CSV File Ignoring Duplicates

sorttext processinguniq

I need to find the 10 most frequent words in a .csv file.
The file is structured so that each line contains comma-separated words. If the same word is repeated more than once in the same line, it should be counted as one.
So, in the example below:

green,blue,blue,yellow,red,yellow
red,blue,green,green,green,brown

green, blue and red should be counted as 2 and yellow and brown as 1

I know similar questions have been asked before, and one solution was:

<file.csv tr -c '[:alnum:]' '[\n*]' | sort|uniq -c|sort -nr|head  -10

But this will count the number of time a word appears in the same line, like this:

  4 green
  3 blue
  2 yellow
  2 red
  1 brown

and this is not actually what I need.
Any help? Also I will appreciate a short explanation of the command and why does the command I found in similar questions does not do what I need.

Best Answer

I would probably reach for perl

Use uniq from the List::Util module to de-duplicate each row.
Use a hash to count the resulting occurrences.

For example

perl -MList::Util=uniq -F, -lnE '
  map { $h{$_}++ } uniq @F 
  }{ 
  foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown

If you have no option except the sort and uniq coreutils, then you can implement a similar algorithm with the addition of a shell loop

while IFS=, read -a words; do 
  printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
  2 red
  2 green
  2 blue
  1 yellow
  1 brown

however please refer to Why is using a shell loop to process text considered bad practice?

Related Solutions

Bash – How to find the most frequent word of each file in a directory

I would use grep with -o to print only the matched string top extract the words:

$ for file in *; do 
    printf '%s : %s\n' "$(grep -Eo '[[:alnum:]]+' "$file" | sort | uniq -c | 
        sort -rn | head -n1)" "$file" 
done
      8 no : file1
     10 so : file2
     12 in : file3

Alternatively, if your grep doesn't support -o, you can use tr to replace all whitespace and punctuation characters with \n, filter through grep . to skip blank lines and then count:

$ for file in *; do 
    printf '%s : %s\n' "$(tr '[[:punct:]][[:space:]]' '\n' < "$file" | grep . | 
      sort | uniq -c | sort -rn | head -n1)" "$file" 
done
  8 no : file1
 10 so : file2
 12 in : file3

Bash – Command line method to find repeat-word typos, with line numbers

Edited: added install and demo

You need to take care of at least some edge cases, like

repeated words at the end (and beginning) of the line.
search should be case insensitive, because of frequent errors like The the apple.
probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
only full words should match to eliminate the thesis
When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

has has
and
and
be be