What’s the easiest way to make a list of most common words in a list

text processing

Say I have a bunch of textfiles containing fiction, non-fiction, newspaper articles, &c (random examples of text in a given language.)

I want a frequency list of the given words, most common word first.

I could write some C code to do this, but if there's a faster way to do this, I'd like to know it. (When I say faster, I mean coding time, not run time.)

Best Answer

For faster coding time, This is what I try successfully right now :

printf '%s\n' $(cat *.txt) | sort | uniq -c | sort -gr | less

Related Solutions

Shell – Find files that have words in common

First generate indices for mainFile,

sed 's/ /\n/g' mainFile | sort | uniq > mainFile.idx

Then do a grep for fixed strings:

grep -F -f mainFile.idx file*

Shell – Strip most frequent words from text

Assuming you've got files named "news.articles1", "news.articles2", etc, and you've got your commonly-used words in a file named "stop.words"

cat news.articles* | tr -s '[:blank:]' '[\n*]' |
tr '[:upper:]' '[:lower:]' | fgrep -v -f stop.words

The output of that pipeline ought to contain none of your commonly-used words. You may need to remove all punctuation with an additional step in the pipeline, like:

tr -d '[:punct:]'

A good English-language version of "stop.words" is often in /usr/share/groff/<version>/eign.

Best Answer

Related Solutions

Shell – Find files that have words in common

Shell – Strip most frequent words from text

Related Question