Text Processing – How to Grep -v and Exclude the Next Line After Match

greptext processing

How to filter out 2 lines for each line matching the grep regex?
this is my minimal test:

SomeTestAAAA
EndTest
SomeTestABCD
EndTest
SomeTestDEFG
EndTest
SomeTestAABC
EndTest
SomeTestACDF
EndTest

And obviously I tried e.g. grep -vA 1 SomeTestAA which doesn't work.

desired output is:

SomeTestABCD
EndTest
SomeTestDEFG
EndTest
SomeTestACDF
EndTest

Best Answer

You can use grep with -P (PCRE) :

grep -P -A 1 'SomeTest(?!AA)' file.txt

(?!AA) is the zero width negative lookahead pattern ensuring that there is no AA after SomeTest.

Test :

$ grep -P -A 1 'SomeTest(?!AA)' file.txt 
SomeTestABCD
EndTest
SomeTestDEFG
EndTest
SomeTestACDF
EndTest

Related Solutions

Text Processing with Sed and Grep – Return Only the Portion of a Line After a Matching Pattern

The canonical tool for that would be sed.

sed -n -e 's/^.*stalled: //p'

Detailed explanation:

-n means not to print anything by default.
-e is followed by a sed command.
s is the pattern replacement command.
The regular expression ^.*stalled: matches the pattern you're looking for, plus any preceding text (.* meaning any text, with an initial ^ to say that the match begins at the beginning of the line). Note that if stalled: occurs several times on the line, this will match the last occurrence.
The match, i.e. everything on the line up to stalled:, is replaced by the empty string (i.e. deleted).
The final p means to print the transformed line.

If you want to retain the matching portion, use a backreference: \1 in the replacement part designates what is inside a group $…$ in the pattern. Here, you could write stalled: again in the replacement part; this feature is useful when the pattern you're looking for is more general than a simple string.

sed -n -e 's/^.*\(stalled: \)/\1/p'

Sometimes you'll want to remove the portion of the line after the match. You can include it in the match by including .*$ at the end of the pattern (any text .* followed by the end of the line $). Unless you put that part in a group that you reference in the replacement text, the end of the line will not be in the output.

As a further illustration of groups and backreferences, this command swaps the part before the match and the part after the match.

sed -n -e 's/^\(.*\)\(stalled: \)\(.*\)$/\3\2\1/p'

Grep – Inverse Match and Exclude Lines Before and After

don's might be better in most cases, but just in case the file is really big, and you can't get sed to handle a script file that large (which can happen at around 5000+ lines of script), here it is with plain sed:

sed -ne:t -e"/\n.*$match/D" \
    -e'$!N;//D;/'"$match/{" \
            -e"s/\n/&/$A;t" \
            -e'$q;bt' -e\}  \
    -e's/\n/&/'"$B;tP"      \
    -e'$!bt' -e:P  -e'P;D'

This is an example of what is called a sliding window on input. It works by building a look-ahead buffer of $B-count lines before ever attempting to print anything.

And actually, probably I should clarify my previous point: the primary performance limiter for both this solution and don's will be directly related to interval. This solution will slow with larger interval sizes, whereas don's will slow with larger interval frequencies. In other words, even if the input file is very large, if the actual interval occurrence is still very infrequent then his solution is probably the way to go. However, if the interval size is relatively manageable, and is likely to occur often, then this is the solution you should choose.

So here's the workflow:

If $match is found in pattern space preceded by a \newline, sed will recursively Delete every \newline that precedes it.
- I was clearing $match's pattern space out completely before - but to easily handle overlap, leaving a landmark seems to work far better.
- I also tried s/.*\n.*$$match$/\1/ to try to get it in one go and dodge the loop, but when $A/$B are large, the Delete loop proves considerably faster.
Then we pull in the Next line of input preceded by a \newline delimiter and try once again to Delete a /\n.*$match/ once again by referring to our most recently used regular expression w/ //.
If pattern space matches $match then it can only do so with $match at the head of the line - all $Before lines have been cleared.
- So we start looping over $After.
- Each run of this loop we'll attempt to s///ubstitute for &itself the $Ath \newline character in pattern space, and, if successful, test will branch us - and our whole $After buffer - out of the script entirely to start the script over from the top with the next input line if any.
- If the test is not successful we'll branch back to the :top label and recurse for another line of input - possibly starting the loop over if $match occurs while gathering $After.
If we get past a $match function loop, then we'll try to print the $last line if this is it, and if !not try to s///ubstitute for &itself the $Bth \newline character in pattern space.
- We'll test this, too, and if it is successful we'll branch to the :Print label.
- If not we'll branch back to :top and get another input line appended to the buffer.
If we make it to :Print we'll Print then Delete up to the first \newline in pattern space and rerun the script from the top with what remains.

And so this time, if we were doing A=2 B=2 match=5; seq 5 | sed...

The pattern space for the first iteration at :Print would look like:

^1\n2\n3$

And that's how sed gathers its $Before buffer. And so sed prints to output $B-count lines behind the input it has gathered. This means that, given our previous example, sed would Print 1 to output, and then Delete that and send back to the top of the script a pattern space which looks like:

^2\n3$

...and at the top of the script the Next input line is retrieved and so the next iteration looks like:

^2\n3\n4$

And so when we find the first occurrence of 5 in input, the pattern space actually looks like:

^3\n4\n5$

Then the Delete loop kicks in and when it's through it looks like:

^5$

And when the Next input line is pulled sed hits EOF and quits. By that time it has only ever Printed lines 1 and 2.

Here's an example run:

A=8 B=7 match='[24689]0'
seq 100 |
sed -ne:t -e"/\n.*$match/D" \
    -e'$!N;//D;/'"$match/{" \
            -e"s/\n/&/$A;t" \
            -e'$q;bt' -e\}  \
    -e's/\n/&/'"$B;tP"      \
    -e'$!bt' -e:P  -e'P;D'

That prints:

Best Answer

Related Solutions

Text Processing with Sed and Grep – Return Only the Portion of a Line After a Matching Pattern

Grep – Inverse Match and Exclude Lines Before and After

Related Question