Command Line – Find Most Frequent Words in a File with Stop Words List

command linetext processing

I want to find the most frequent words in a text file, with using a stop words list. I already have this code:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10 > test.txt

from an old post
but my file contains something like this:

240 
 21 ipsum
 20 Lorem
 11 Textes
 9 Blindtexte
 7 Text
 5 F
 5 Blindtext
 4 Texte
 4 Buchstaben

The first one is just a Space and in the text they are punctuation marks (like points), but I don´t want this, so what does I have to add?

Best Answer

Consider this test file:

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

To get a word count:

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

How it works

grep -oE '[[:alpha:]]+' text.txt

This returns all words, minus any spaces or punctuation, with one word per line.
sort

This sorts the words into alphabetical order.
uniq -c

This counts the number of times each word occurs. (For uniq to work, its input must be sorted.)
sort -nr

This sorts the output numerically so that the most frequent word is at the top.

Handling mixed case

Consider this mixed-case test file:

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

If we want to count some and Some as the same:

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

Here, we added the -f option to sort so that it would ignore case and the -i option to uniq so that it also would ignore case.

Excluding stop words

Suppose that we want to exclude these stop words from the count:

$ cat stopwords 
with
not
has
do

So, we add grep -v to eliminate these words:

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

Related Solutions

Command Line – Find the Most Frequent Words in a File

That's pretty much the most common way of finding "N most common things", except you're missing a sort, and you've got a gratuitious cat:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

If you don't put in a sort before the uniq -c you'll probably get a lot of false singleton words. uniq only does unique runs of lines, not overall uniquness.

EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named eign in it which contains a pretty decent list of stop words. My Arch distro has /usr/share/groff/current/eign, but I think I've also seen /usr/share/dict/eign or /usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10

My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.

EDIT: fgrep should use the -w command, which enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i".

What’s the easiest way to make a list of most common words in a list

For faster coding time, This is what I try successfully right now :

printf '%s\n' $(cat *.txt) | sort | uniq -c | sort -gr | less