I need to count the lines containing the words the
and an
in a text file (poem.txt
), but not those containing both.
I've tried using
grep -c the poem.txt | grep -c an poem.txt
but this gives me the wrong answer of 6 when the total number of the
and an
is 9 lines.
I do want to count the lines containing the words and not the words themselves. Only the actual word should count, so the
but not there
, and an
but not Pan
.
Example file: poem.txt
Where is the misty shark?
Where is she?
The small reef roughly fights the mast.
Where is the small gull?
Where is he?
The gull grows like a clear pirate.
Clouds fall like old mainlands.
She will Rise calmly like a dead pirate.
Eat an orange.
Warm, sunny sharks quietly pull a cold, old breeze.
All ships command rough, rainy sails.
Elvis Aaron Presley also known simply as the Elvis
He is also referred to as the King
The best-selling solo music artist of all time
He was the most commercially successful artist in many genres
He has many awards including a Grammy lifetime achievement
Elvis in the 1970s has numerous jumpsuits including an eagle one.
To clarify further: how many lines in the poem contain the
or an
but you should not count the lines that include both the
and an
.
the car is red - this counted
an apple is in the corner - not counted
hello i am big - not counted
where is an apple - counted
So here the output should be 2.
Edit: I'm not worried about case sensitivity.
Final edit: Thanks for all your help.
i've managed to solve the problem. i used the one of the answer and changed it a little. i used
cat poem.txt | grep -Evi -e '\<an .* the\>' -e '\<the .* an\>' | grep -Eci -e '\<(an|the)\>
how ever i did change the -c
in the second grep to a -n
for some added info.
Yet again thank you for all the help!! 🙂
Best Answer
With grep:
This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.
Breakdown:
The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.
If you remove the
c
from the second grep's-Eci
, you will see all matches highlighted.Details:
The
-E
option enables extended expression syntax (ERE) for grep.The
-i
option tells grep to match case-insensitiveThe
-v
option tells grep to invert the result (i.e. match lines not containing the pattern)The
-c
option tells grep to output the number of matched lines instead of the lines themselvesThe patterns:
\<
matches the beginning of a word (thanks @glenn-jackman)\>
matches the end of a word (thanks @glenn-jackman)--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
grep -Evi -e '\<an\>.*\<the\>'
thus matches all lines not containing 'an ... the'Similarly,
grep -Evi -e '\<the\>.*\<an\>'
matches all lines not containing 'the ... an'grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>'
is the combination of the 3. and 4.grep -Eci -e '\<(an|the)\>'
matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched linesEDIT 1: Use
\<
and\>
instead of( |^)
and( |$)
, as suggested by @glenn-jackmanEDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:
This uses the
-o
option of grep, which prints every match in a separate line (and nothing else) and thenwc -l
to count the lines.