I want to find the most frequent words in a text file, with using a stop words list. I already have this code:
tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head -10 > test.txt
from an old post
but my file contains something like this:
240
21 ipsum
20 Lorem
11 Textes
9 Blindtexte
7 Text
5 F
5 Blindtext
4 Texte
4 Buchstaben
The first one is just a Space and in the text they are punctuation marks (like points), but I donĀ“t want this, so what does I have to add?
Best Answer
Consider this test file:
To get a word count:
How it works
grep -oE '[[:alpha:]]+' text.txt
This returns all words, minus any spaces or punctuation, with one word per line.
sort
This sorts the words into alphabetical order.
uniq -c
This counts the number of times each word occurs. (For
uniq
to work, its input must be sorted.)sort -nr
This sorts the output numerically so that the most frequent word is at the top.
Handling mixed case
Consider this mixed-case test file:
If we want to count
some
andSome
as the same:Here, we added the
-f
option tosort
so that it would ignore case and the-i
option touniq
so that it also would ignore case.Excluding stop words
Suppose that we want to exclude these stop words from the count:
So, we add
grep -v
to eliminate these words: