How to get the text between two words specified by their indices

text processing

Using awk, I can print the words of the given indices as following.

$ echo "The quick brown fox jumps over the lazy dog" | awk  '{print $3, $7}'
brown the

But I also want to get the text between the specified words, "brown" and "the". So I want the output to be like that.

brown fox jumps over the

It's not necessary to use awk specifically, but the indexing and tokenization of words should match that of awk to keep consistency with the other parts in my shell script that use awk.

I thought about printing the words from the first index to the last index, but this doesn't retain the whitespaces between words.

To put this in a complicated but more accurate way, I want to get the text that begins at the beginning of some word specified by an index and ends at the end of another word specified by another index. How can I achieve that (preferably without bash loops)?

Best Answer

With gawk, you can use the split() function to determine fields and their separators:

$ echo "The quick brown fox   jumps over the lazy dog" | awk '{ split($0, a, "\\s+", s); for (i = 3; i <= 7 && i <= length(a); i++) printf "%s%s", a[i], (i < 7 ? s[i] : "\n") }'
brown fox   jumps over the

Related Solutions

Get text between a word and the last line

Including the last line you'd do:

sed -n '/word/,$p'

That matches the first occurrence of word all the way until the last line and prints all matches.

Not including the last line:

sed '/word/,$!d;$d'

...which deletes negated matches and then deletes the last line.

And to get from only the last match to the last line you have to try a little harder:

 sed -e :n -e '/\n.*word/D;N;$q;bn'

It loops - it never completes the normal sed line cycle but instead appends the next input line to the pattern space buffer and branches back to do so again. But when it has at least two lines in pattern space and the last matches word it deletes everything in the buffer but the line that matches word. On the last line it just quits and breaks the loop. So what gets printed is everything from the last occurring line containing word to the last line.

Hmmm... maybe I made that harder than it has to be:

sed 'H;$x;/word/h;$!d'

With that one every line is appended to hold space. But lines matching word then overwrite hold space. Every line in pattern space that is not the last line is deleted. And on the last line, just after it is appended to hold space, the hold and pattern spaces are exchanged (in case the last line also contains word) and everything from the last time word overwrote hold space is printed.

How to Get Last Occurrence of Lines Between Two Patterns

You can always do:

tac < fileName | sed  '/EndPattern/,$!d;/StartPattern/q' | tac

If your system doesn't have GNU tac, you may be able to use tail -r instead.

You can also do it like:

awk '
  inside {
    text = text $0 RS
    if (/EndPattern/) inside=0
    next
  }
  /StartPattern/ {
    inside = 1
    text = $0 RS
  }
  END {printf "%s", text}' < filename

But that means reading the whole file.

Note that it may give different results if there's another StartPattern in between a StartPattern and the next EndPattern or if the last StartPattern does not have an ending EndPattern or if there are lines matching both StartPattern and EndPattern.

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {printf "%s", text}' < filename

Would make it behave more like the tac+sed+tac approach (except for the unclosed trailing StartPattern case).

That last one seems to be the closest to your edited requirements. To add the warning would simply be:

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {
    printf "%s", text
    if (inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
  }' < filename

To avoid reading the whole file:

tac < filename | awk '
  /StartPattern/ {
    printf "%s", $0 RS text
    if (!inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
    exit
  }
  /EndPattern/ {inside = 1; text = ""}
  {text = $0 RS text}'

Portability note: for /dev/stderr, you need either a system with such a special file (beware that on Linux if stderr is open on a seekable file that will write the text at the beginning of the file instead of the current position within the file) or an awk implementation that emulates it like gawk, mawk or busybox awk (those work around the Linux issue mentioned above).

On other systems, you can replace print ... > "/dev/stderr" with print ... | "cat>&2".

Best Answer

Related Solutions

Get text between a word and the last line

How to Get Last Occurrence of Lines Between Two Patterns

Related Question