Text Processing – Remove Lines from File Up to a Pattern

awkfiltersedtext processing

I am attempting to write a filter using something like sed or awk to do the following:

If a given pattern does not exist in the input, copy the entire input to the output
If the pattern exists in the input, copy only the lines after the first occurrence to the output

This happens to be for a "git clean" filter, but that's probably not important. The important aspect is this needs to be implemented as a filter, because the input is provided on stdin.

I know how to use sed to delete lines up to a pattern, eg. 1,/pattern/d but that deletes the entire input if /pattern/ is not matched anywhere.

I can imagine writing a whole shell script that creates a temporary file, does a grep -q or something, and then decides how to process the input. I'd prefer to do this without messing around creating a temporary file, if possible. This needs to be efficient because git might call it frequently.

Best Answer

If your files are not too large to fit in memory, you could use perl to slurp the file:

perl -0777pe 's/.*?PAT[^\n]*\n?//s' file

Just change PAT to whatever pattern you're after. For example, given these two input files and the pattern 5:

$ cat file
1
2
3
4
5
11
12
13
14
15
$ cat file1 
foo
bar
$ perl -0777pe 's/.*?5[^\n]*\n?//s' file
11
12
13
14
15
$ perl -0777pe 's/.*?10[^\n]*\n?//s' file1
foo
bar

Explanation

-pe : read the input file line by line, apply the script given by -e to each line and print.
-0777 : slurp the entire file into memory.
s/.*?PAT[^\n]*\n?//s : remove everything until the 1st occurrence of PAT and until the end of the line.

For larger files, I don't see any way to avoid reading the file twice. Something like:

awk -vpat=5 '{
              if(NR==FNR){
                if($0~pat && !a){a++; next} 
                if(a){print}
              }
              else{ 
                if(!a){print}
                else{exit} 
              }
             }' file1 file1

Explanation

awk -vpat=5 : run awk and set the variable pat to 5.
if(NR==FNR){} : if this is the 1st file.
if($0~pat && !a){a++; next} : if this line matches the value of pat and a is not defined, increment a by one and skip to the next line.
if(a){print} : if a is defined (if this file matches the pattern), print the line.
else{ } : if this is not the 1st file (so it's the second pass).
if(!a){print} if a is not defined, we want the whole file, so print every line.
else{exit} : if a is defined, we've already printed in the 1st pass so there's no need to reprocess the file.

Related Solutions

Bash – Using bash variable with escape character in awk to extract lines from file

Several options here:

pat1, pat2 treated as regexps:
```
pat1="A sentence here"
pat2='\*{58}'
export pat1 pat2
awk '$0 ~ ENVIRON["pat1"], $0 ~ ENVIRON["pat2"]'
```
Note that mawk and versions of gawk prior to 4.0.0 do not support the {} extended regular expression operator. For old versions of gawk, you can pass the POSIXLY_CORRECT environment variable to make it recognise it.

Here using the start-condition, end-condition [{action}] approach, but you could do the same with your p flag approach.
pat1, pat2 treated as fixed strings:
```
pat1="A sentence here"
pat2=$(printf '*%.0s' {1..58})
export pat1 pat2
awk 'index($0, ENVIRON["pat1"]), index($0, ENVIRON["pat2"])'
```
Here, index() searches for the needle (the variable content) anywhere in the haystack (the current record (line)), but you could also do a simple full-line comparison:
```
awk '"" $0 == ENVIRON["pat1"], "" $0 == ENVIRON["pat2"]'
```
(the "" is to force a string comparison even in cases where both $0 and ENVIRON["patx"] are numerical).

Avoid using -v to pass data that may contain backslash characters as awk does some C escape sequence (\n, \b, \\...) processing on them so you'd need to escape the backslashes (and with GNU awk 4.2 or above, values that start with @/ and end in / are also a problem). Same goes for variables passed like awk '...code...' awkvar="$shellvar". Use ENVIRON or ARGV instead.

See this answer to a related question for further details.

SED – Delete All Lines Before Matching One, Including This One

Similar to your "clean solution":

sed -e '1,/HI_THERE/d' input_file

The first line in the file is line 1 - there's no special ^ address because you always know that, while $ is needed for the end because you don't (necessarily) know which line that is.

This does fall over if the matching line is the first line of the file. With GNU sed you can use 0 instead of 1 to deal with that. For POSIX sed and for portability (which seem to be different in this case) it's more complex (see comments below and this follow-up question).

Best Answer

Explanation

Explanation

Related Solutions

Bash – Using bash variable with escape character in awk to extract lines from file

SED – Delete All Lines Before Matching One, Including This One

Related Question