Shell – Strip most frequent words from text

shell-scripttext processing

I have a simple problem, but unfortunately I don't know where to even start (I'm just starting out). So, what I want to do is ultimately increase my vocabulary. I got the idea to strip the most commonly used words out of news articles. I found a list of the 5,000 most commonly used words and saved it. After I get the most commonly used words stripped out, I can create a corpus in TextSTAT and do a word frequency count and choose which words I want to learn that way. But how do I get the words in my most commonly used words list to be removed from the articles I'll be saving?

Best Answer

Assuming you've got files named "news.articles1", "news.articles2", etc, and you've got your commonly-used words in a file named "stop.words"

cat news.articles* | tr -s '[:blank:]' '[\n*]' |
tr '[:upper:]' '[:lower:]' | fgrep -v -f stop.words

The output of that pipeline ought to contain none of your commonly-used words. You may need to remove all punctuation with an additional step in the pipeline, like:

tr -d '[:punct:]'

A good English-language version of "stop.words" is often in /usr/share/groff/<version>/eign.

Related Solutions

Shell – Strip // Comments From Files

A quick google search returns a similar question at stackoverflow.

Command Line – Find the Most Frequent Words in a File

That's pretty much the most common way of finding "N most common things", except you're missing a sort, and you've got a gratuitious cat:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

If you don't put in a sort before the uniq -c you'll probably get a lot of false singleton words. uniq only does unique runs of lines, not overall uniquness.

EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named eign in it which contains a pretty decent list of stop words. My Arch distro has /usr/share/groff/current/eign, but I think I've also seen /usr/share/dict/eign or /usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10

My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.

EDIT: fgrep should use the -w command, which enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i".

Best Answer

Related Solutions

Shell – Strip // Comments From Files

Command Line – Find the Most Frequent Words in a File

Related Question