Consider a text file with the following entries:
aaa
bbb
ccc
ddd
eee
fff
ggg
hhh
iii
Given a pattern (e.g. fff
), I would like to grep the file above to get in the output:
all_lines except (pattern_matching_lines U (B lines_before) U (A lines_after))
For example, if B = 2
and A = 1
, the output with pattern = fff
should be:
aaa
bbb
ccc
hhh
iii
How can I do this with grep or other command line tools?
Note, when I try:
grep -v 'fff' -A1 -B2 file.txt
I don't get what I want. I instead get:
aaa
bbb
ccc
ddd
eee
fff
--
--
fff
ggg
hhh
iii
Best Answer
don's might be better in most cases, but just in case the file is really big, and you can't get
sed
to handle a script file that large (which can happen at around 5000+ lines of script), here it is with plainsed
:This is an example of what is called a sliding window on input. It works by building a look-ahead buffer of
$B
-count lines before ever attempting to print anything.And actually, probably I should clarify my previous point: the primary performance limiter for both this solution and don's will be directly related to interval. This solution will slow with larger interval sizes, whereas don's will slow with larger interval frequencies. In other words, even if the input file is very large, if the actual interval occurrence is still very infrequent then his solution is probably the way to go. However, if the interval size is relatively manageable, and is likely to occur often, then this is the solution you should choose.
So here's the workflow:
$match
is found in pattern space preceded by a\n
ewline,sed
will recursivelyD
elete every\n
ewline that precedes it.$match
's pattern space out completely before - but to easily handle overlap, leaving a landmark seems to work far better.s/.*\n.*\($match\)/\1/
to try to get it in one go and dodge the loop, but when$A/$B
are large, theD
elete loop proves considerably faster.N
ext line of input preceded by a\n
ewline delimiter and try once again toD
elete a/\n.*$match/
once again by referring to our most recently used regular expression w///
.$match
then it can only do so with$match
at the head of the line - all$B
efore lines have been cleared.$A
fter.s///
ubstitute for&
itself the$A
th\n
ewline character in pattern space, and, if successful,t
est will branch us - and our whole$A
fter buffer - out of the script entirely to start the script over from the top with the next input line if any.t
est is not successful we'llb
ranch back to the:t
op label and recurse for another line of input - possibly starting the loop over if$match
occurs while gathering$A
fter.$match
function loop, then we'll try top
rint the$
last line if this is it, and if!
not try tos///
ubstitute for&
itself the$B
th\n
ewline character in pattern space.t
est this, too, and if it is successful we'll branch to the:P
rint label.:t
op and get another input line appended to the buffer.:P
rint we'llP
rint thenD
elete up to the first\n
ewline in pattern space and rerun the script from the top with what remains.And so this time, if we were doing
A=2 B=2 match=5; seq 5 | sed...
The pattern space for the first iteration at
:P
rint would look like:And that's how
sed
gathers its$B
efore buffer. And sosed
prints to output$B
-count lines behind the input it has gathered. This means that, given our previous example,sed
wouldP
rint1
to output, and thenD
elete that and send back to the top of the script a pattern space which looks like:...and at the top of the script the
N
ext input line is retrieved and so the next iteration looks like:And so when we find the first occurrence of
5
in input, the pattern space actually looks like:Then the
D
elete loop kicks in and when it's through it looks like:And when the
N
ext input line is pulledsed
hits EOF and quits. By that time it has only everP
rinted lines 1 and 2.Here's an example run:
That prints: