Text Processing – Remove Lines Based on Pattern While Keeping First N Matches

awksedtext processing

I need to remove lines from a text file based on pattern but I need to keep the first n lines that match the pattern.

Input

% 1 
% 2
% 3
% 4
% 5
text1
text2
text3

output

%1
%2
text1
text2
text3

I used sed /^%/d file but it deletes all the lines starting with %, sed 3,/^%/d doesn't work either. I need to keep first n lines of the pattern and delete the rest

Best Answer

If you want to delete all lines starting with % put preserving the first two lines of input, you could do:

sed -e 1,2b -e '/^%/d'

Though the same would be more legible with awk:

awk 'NR <= 2 || !/^%/'

Or, if you're after performance:

{ head -n 2; grep -v '^%'; } < input-file

If you want to preserve the first two lines matching the pattern while they may not be the first ones of the input, awk would certainly be a better option:

awk '!/^%/ || ++n <= 2'

With sed, you could use tricks like:

sed -e '/^%/!b' -e 'x;/xx/{h;d;}' -e 's/^/x/;x'

That is, use the hold space to count the number of occurrences of the patterns matched so far. Not terribly efficient or legible.

Explanation

-pe : read the input file line by line, apply the script given by -e to each line and print.
-0777 : slurp the entire file into memory.
s/.*?PAT[^\n]*\n?//s : remove everything until the 1st occurrence of PAT and until the end of the line.

For larger files, I don't see any way to avoid reading the file twice. Something like:

awk -vpat=5 '{
              if(NR==FNR){
                if($0~pat && !a){a++; next} 
                if(a){print}
              }
              else{ 
                if(!a){print}
                else{exit} 
              }
             }' file1 file1

Explanation

awk -vpat=5 : run awk and set the variable pat to 5.
if(NR==FNR){} : if this is the 1st file.
if($0~pat && !a){a++; next} : if this line matches the value of pat and a is not defined, increment a by one and skip to the next line.
if(a){print} : if a is defined (if this file matches the pattern), print the line.
else{ } : if this is not the 1st file (so it's the second pass).
if(!a){print} if a is not defined, we want the whole file, so print every line.
else{exit} : if a is defined, we've already printed in the 1st pass so there's no need to reprocess the file.

Keep only the first line from every sequence of consecutive lines matching a pattern

Using awk:

awk '/logical IO/ {if (!seen) {print; seen=1}; next}; {print; seen=0}' file.txt

/logical IO/ {if (!seen) {print; seen=1}; next} checks if the line contains logical IO, if found and the variable seen is false i.e. previous line does not contain logical IO, then print the line, set seen=1 and go to the next line else go to the next line as the previous line has logical IO
For any other line, {print; seen=0}, prints the line and the sets seen=0

Example:

$ cat file.txt 
select * from test1 where 1=1
testing logical IO 24
select * from test2 where condition=4
parsing logical IO 45
testing logical IO 500
select * from test5 where 1=1
testing logical IO 24
select * from test5 where condition=78
parsing logical IO 346
parsing logical IO 346
testing logical IO 12

$ awk '/logical IO/ {if (!seen) {print; seen=1}; next}; {print; seen=0}' file.txt 
select * from test1 where 1=1
testing logical IO 24
select * from test2 where condition=4
parsing logical IO 45
select * from test5 where 1=1
testing logical IO 24
select * from test5 where condition=78
parsing logical IO 346

Best Answer

Related Solutions

Text Processing – Remove Lines from File Up to a Pattern

Explanation

Explanation

Keep only the first line from every sequence of consecutive lines matching a pattern

Related Question