Text Processing – Remove Lines from File Up to a Pattern

awkfiltersedtext processing

I am attempting to write a filter using something like sed or awk to do the following:

  • If a given pattern does not exist in the input, copy the entire input to the output
  • If the pattern exists in the input, copy only the lines after the first occurrence to the output

This happens to be for a "git clean" filter, but that's probably not important. The important aspect is this needs to be implemented as a filter, because the input is provided on stdin.

I know how to use sed to delete lines up to a pattern, eg. 1,/pattern/d but that deletes the entire input if /pattern/ is not matched anywhere.

I can imagine writing a whole shell script that creates a temporary file, does a grep -q or something, and then decides how to process the input. I'd prefer to do this without messing around creating a temporary file, if possible. This needs to be efficient because git might call it frequently.

Best Answer

If your files are not too large to fit in memory, you could use perl to slurp the file:

perl -0777pe 's/.*?PAT[^\n]*\n?//s' file

Just change PAT to whatever pattern you're after. For example, given these two input files and the pattern 5:

$ cat file
1
2
3
4
5
11
12
13
14
15
$ cat file1 
foo
bar
$ perl -0777pe 's/.*?5[^\n]*\n?//s' file
11
12
13
14
15
$ perl -0777pe 's/.*?10[^\n]*\n?//s' file1
foo
bar

Explanation

  • -pe : read the input file line by line, apply the script given by -e to each line and print.
  • -0777 : slurp the entire file into memory.
  • s/.*?PAT[^\n]*\n?//s : remove everything until the 1st occurrence of PAT and until the end of the line.

For larger files, I don't see any way to avoid reading the file twice. Something like:

awk -vpat=5 '{
              if(NR==FNR){
                if($0~pat && !a){a++; next} 
                if(a){print}
              }
              else{ 
                if(!a){print}
                else{exit} 
              }
             }' file1 file1

Explanation

  • awk -vpat=5 : run awk and set the variable pat to 5.
  • if(NR==FNR){} : if this is the 1st file.
  • if($0~pat && !a){a++; next} : if this line matches the value of pat and a is not defined, increment a by one and skip to the next line.
  • if(a){print} : if a is defined (if this file matches the pattern), print the line.
  • else{ } : if this is not the 1st file (so it's the second pass).
  • if(!a){print} if a is not defined, we want the whole file, so print every line.
  • else{exit} : if a is defined, we've already printed in the 1st pass so there's no need to reprocess the file.
Related Question