Bash – Using bash variable with escape character in awk to extract lines from file

awkbash

I am writing a bash script (just learning bash) to extract some lines from a file based on two patterns. The first pattern is just a sentence ending in a colon. The second pattern is a * repeated N (in this case 58) times.

An example file:

lines I don not want
lines I don not want
lines I don not want

A sentence here:
********************************************************
lines I want
lines I want
lines I want
**********************************************************

lines I don not want
lines I don not want
lines I don not want

Desired output:

A sentence here:
********************************************************
lines I want
lines I want
lines I want
**********************************************************

I can get the script to work if I explicitly type out A sentence here and \* 58 times within the call to awk, but cleanliness and readability I would prefer to do something like below:

pat1="A sentence here"
pat2=`printf -- '\*%.s' {1..58} ; echo`
pat2=${pat2//\\/\\\\}
awk -v pat1="${pat1}" -v pat2="${pat2}" '/{pat1}/ {p=1}; p; /{pat2}/ {p=0}' $1

Where the first positional variable is the input file. The above code returns nothing. I initially tried it without the substitution on pat2, but got the warning:

awk: warning: escape sequence `\*' treated as plain `*'

I will have to run this command thousands of times and would ideally like a solution that is both clean and efficient. I'm not tied to using awk at all.

Edit:

I just noticed that even when I manually type the patterns into awk, I still receive the warning message. I am likely not passing the variables to awk correctly.

Best Answer

Several options here:

  • pat1, pat2 treated as regexps:

    pat1="A sentence here"
    pat2='\*{58}'
    export pat1 pat2
    awk '$0 ~ ENVIRON["pat1"], $0 ~ ENVIRON["pat2"]'
    

    Note that mawk and versions of gawk prior to 4.0.0 do not support the {} extended regular expression operator. For old versions of gawk, you can pass the POSIXLY_CORRECT environment variable to make it recognise it.

    Here using the start-condition, end-condition [{action}] approach, but you could do the same with your p flag approach.

  • pat1, pat2 treated as fixed strings:

    pat1="A sentence here"
    pat2=$(printf '*%.0s' {1..58})
    export pat1 pat2
    awk 'index($0, ENVIRON["pat1"]), index($0, ENVIRON["pat2"])'
    

    Here, index() searches for the needle (the variable content) anywhere in the haystack (the current record (line)), but you could also do a simple full-line comparison:

    awk '"" $0 == ENVIRON["pat1"], "" $0 == ENVIRON["pat2"]'
    

    (the "" is to force a string comparison even in cases where both $0 and ENVIRON["patx"] are numerical).

Avoid using -v to pass data that may contain backslash characters as awk does some C escape sequence (\n, \b, \\...) processing on them so you'd need to escape the backslashes (and with GNU awk 4.2 or above, values that start with @/ and end in / are also a problem). Same goes for variables passed like awk '...code...' awkvar="$shellvar". Use ENVIRON or ARGV instead.

See this answer to a related question for further details.

Related Question