Shell – Strip most frequent words from text

shell-scripttext processing

I have a simple problem, but unfortunately I don't know where to even start (I'm just starting out). So, what I want to do is ultimately increase my vocabulary. I got the idea to strip the most commonly used words out of news articles. I found a list of the 5,000 most commonly used words and saved it. After I get the most commonly used words stripped out, I can create a corpus in TextSTAT and do a word frequency count and choose which words I want to learn that way. But how do I get the words in my most commonly used words list to be removed from the articles I'll be saving?

Best Answer

Assuming you've got files named "news.articles1", "news.articles2", etc, and you've got your commonly-used words in a file named "stop.words"

cat news.articles* | tr -s '[:blank:]' '[\n*]' |
tr '[:upper:]' '[:lower:]' | fgrep -v -f stop.words 

The output of that pipeline ought to contain none of your commonly-used words. You may need to remove all punctuation with an additional step in the pipeline, like:

tr -d '[:punct:]'

A good English-language version of "stop.words" is often in /usr/share/groff/<version>/eign.

Related Question