Command Line – Find Most Frequent Words in a File with Stop Words List

command linetext processing

I want to find the most frequent words in a text file, with using a stop words list. I already have this code:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10 > test.txt

from an old post
but my file contains something like this:

240 
 21 ipsum
 20 Lorem
 11 Textes
 9 Blindtexte
 7 Text
 5 F
 5 Blindtext
 4 Texte
 4 Buchstaben

The first one is just a Space and in the text they are punctuation marks (like points), but I donĀ“t want this, so what does I have to add?

Best Answer

Consider this test file:

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

To get a word count:

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

How it works

  • grep -oE '[[:alpha:]]+' text.txt

    This returns all words, minus any spaces or punctuation, with one word per line.

  • sort

    This sorts the words into alphabetical order.

  • uniq -c

    This counts the number of times each word occurs. (For uniq to work, its input must be sorted.)

  • sort -nr

    This sorts the output numerically so that the most frequent word is at the top.

Handling mixed case

Consider this mixed-case test file:

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

If we want to count some and Some as the same:

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

Here, we added the -f option to sort so that it would ignore case and the -i option to uniq so that it also would ignore case.

Excluding stop words

Suppose that we want to exclude these stop words from the count:

$ cat stopwords 
with
not
has
do

So, we add grep -v to eliminate these words:

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file