Text Processing – How to Count Lines Containing One of Two Words but Not Both

awkgreptext processing

I need to count the lines containing the words the and an in a text file (poem.txt), but not those containing both.

I've tried using

grep -c the poem.txt | grep -c an poem.txt

but this gives me the wrong answer of 6 when the total number of the and an is 9 lines.

I do want to count the lines containing the words and not the words themselves. Only the actual word should count, so the but not there, and an but not Pan.

Example file: poem.txt

Where is the misty shark?
Where is she?
The small reef roughly fights the mast.
Where is the small gull?
Where is he?
The gull grows like a clear pirate.
Clouds fall like old mainlands.

She will Rise calmly like a dead pirate.
Eat an orange.
Warm, sunny sharks quietly pull a cold, old breeze.
All ships command rough, rainy sails.

Elvis Aaron Presley also known simply as the Elvis
He is also referred to as the King
The best-selling solo music artist of all time
He was the most commercially successful artist in many genres

He has many awards including a Grammy lifetime achievement
Elvis in the 1970s has numerous jumpsuits including an eagle one.

To clarify further: how many lines in the poem contain the or an but you should not count the lines that include both the and an.

the car is red - this counted
an apple is in the corner - not counted
hello i am big - not counted
where is an apple - counted

So here the output should be 2.

Edit: I'm not worried about case sensitivity.

screenshot of highlighted words

Final edit: Thanks for all your help.
i've managed to solve the problem. i used the one of the answer and changed it a little. i used
cat poem.txt | grep -Evi -e '\<an .* the\>' -e '\<the .* an\>' | grep -Eci -e '\<(an|the)\> how ever i did change the -c in the second grep to a -n for some added info.
Yet again thank you for all the help!! 🙂

Best Answer

With grep:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.

Breakdown:

The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.

If you remove the c from the second grep's -Eci, you will see all matches highlighted.

Details:

  • The -E option enables extended expression syntax (ERE) for grep.

  • The -i option tells grep to match case-insensitive

  • The -v option tells grep to invert the result (i.e. match lines not containing the pattern)

  • The -c option tells grep to output the number of matched lines instead of the lines themselves

  • The patterns:

    1. \< matches the beginning of a word (thanks @glenn-jackman)
    2. \> matches the end of a word (thanks @glenn-jackman)

    --> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')

    1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the'

    2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'

    3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.

    4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

EDIT 1: Use \< and \> instead of ( |^) and ( |$), as suggested by @glenn-jackman

EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

This uses the -o option of grep, which prints every match in a separate line (and nothing else) and then wc -l to count the lines.

Related Question