Count Word Occurrences in Text File – Using Grep and Cut

cutgreptext processing

I have a text file containing tweets and I'm required to count the number of times a word is mentioned in the tweet. For example, the file contains:

Apple iPhone X is going to worth a fortune
The iPhone X is Apple's latest flagship iPhone. How will it pit against it's competitors?

And let's say I want to count how many times the word iPhone is mentioned in the file. So here's what I've tried.

cut -f 1 Tweet_Data | grep -i "iPhone" | wc -l

it certainly works but I'm confused about the 'wc' command in unix. What is the difference if I try something like:

cut -f 1 Tweet_Data | grep -c "iPhone"

where -c is used instead? Both of these yield different results in a large file full of tweets and I'm confused on how it works. Which method is the correct way of counting the occurrence?

Best Answer

Given such a requirement, I would use a GNU grep (for the -o option), then pass it through wc to count the total number of occurrences:

$ grep -o -i iphone Tweet_Data | wc -l
3

Plain grep -c on the data will count the number of lines that match, not the total number of words that match. Using the -o option tells grep to output each match on its own line, no matter how many times the match was found in the original line.

wc -l tells the wc utility to count the number of lines. After grep puts each match in its own line, this is the total number of occurrences of the word in the input.

If GNU grep is not available (or desired), you could transform the input with tr so that each word is on its own line, then use grep -c to count:

$ tr '[:space:]' '[\n*]' < Tweet_Data | grep -i -c iphone
3

Related Solutions

Bash – counting multiple patterns in a single pass with grep

IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

Linux – how to pipe the output of cut to the foreach command

In bash

while read -r word
do
    grep -q "$word" file.before
    if [ $? -ne "0" ]
    then
        echo "$word not in file"
     fi
done < <(cut -f1 -d" " file.after)

The -q to grep tells it to be quiet, you can then interrogate $? to see if there was a match 0 or not 1.

Best Answer

Related Solutions

Bash – counting multiple patterns in a single pass with grep

Linux – how to pipe the output of cut to the foreach command

Related Question