I want to find, say, 10 most common word in a text file. Firstly, solution should be optimized for keystrokes (in other words – my time). Secondly, for the performance. Here is what I have so far to get top 10:
cat test.txt | tr -c '[:alnum:]' '[\n*]' | uniq -c | sort -nr | head -10
6 k
2 g
2 e
2 a
1 r
1 k22
1 k
1 f
1 eeeeeeeeeeeeeeeeeeeee
1 d
I could make a java, python etc. program where I store (word, numberOfOccurences) in a dictionary and sort the value or I could use MapReduce, but I optimize for keystrokes.
Are there any false positives? Is there a better way?
Best Answer
That's pretty much the most common way of finding "N most common things", except you're missing a
sort
, and you've got a gratuitiouscat
:If you don't put in a
sort
before theuniq -c
you'll probably get a lot of false singleton words.uniq
only does unique runs of lines, not overall uniquness.EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named
eign
in it which contains a pretty decent list of stop words. My Arch distro has/usr/share/groff/current/eign
, but I think I've also seen/usr/share/dict/eign
or/usr/dict/eign
in old Unixes.You can use stop words like this:
My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.
EDIT:
fgrep
should use the-w
command, which enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i".