Text Processing – Finding Text Between Two Specific Characters or Strings

regular expressiontext processing

Say I have lines like this:

*[234]*
*[23]*
*[1453]*

where * represents any string (except a string of the form [number]). How can I parse these lines with a command line utility and extract the number between brackets?

More generally, which of these tools cut, sed, grep or awk would be appropriate for such task?

Best Answer

If you have GNU grep, you can use its -o option to search for a regex and output only the matching part. (Other grep implementations can only show the whole line.) If there are several matches on one line, they are printed on separate lines.

grep -o '\[[0-9]*\]'

If you only want the digits and not the brackets, it's a little harder; you need to use a zero-width assertion: a regexp that matches the empty string, but only if it is preceded, or followed as the case may be, by a bracket. Zero-width assertions are only available in Perl syntax.

grep -P -o '(?<=\[)[0-9]*(?=\])'

With sed, you need to turn off printing with -n, and match the whole line and retain only the matching part. If there are several possible matches on one line, only the last match is printed. See Extracting a regex matched with 'sed' without printing the surrounding characters for more details on using sed here.

sed -n 's/^.*\(\[[0-9]*\]\).*/\1/p'

or if you only want the digits and not the brackets:

sed -n 's/^.*\[\([0-9]*\)\].*/\1/p'

Without grep -o, Perl is the tool of choice here if you want something that's both simple and comprehensible. On every line (-n), if the line contains a match for \[[0-9]*\], then print that match ($&) and a newline (-l).

perl -l -ne '/\[[0-9]*\]/ and print $&'

If you only want the digits, put parentheses in the regex to delimit a group, and print only that group.

perl -l -ne '/\[([0-9]*)\]/ and print $1'

P.S. If you only want to require one or more digits between the brackets, change [0-9]* to [0-9][0-9]*, or to [0-9]+ in Perl.

Related Solutions

Grep – How to Find Text Between Two Strings

With grep -P/pcregrep, using a positive look-behind and a positive look-ahead:

grep -P -o '(?<=STRING1).*?(?=STRING2)' infile

in your case replace STRING1 with filename- and STRING2 with \.tar\.gz

If you don't have access to pcregrep and/or if your grep doesn't support -P you can do this with your favourite text processing tool. Here's a portable way with ed that gives you the same output:

ed -s infile <<\IN
g/STRING1/s//\ 
&/g
v/STRING1.*STRING2/d
,s/STRING1//
,s/STRING2.*//
,p
IN

How it works: a newline is prepended to each STRING1 occurrence (so now there's at most one occurrence per line) then all lines not matching STRING1.*STRING2 are deleted; on the remaining ones we only keep what's between STRING1 and STRING2 and print the result.

How to Get Last Occurrence of Lines Between Two Patterns

You can always do:

tac < fileName | sed  '/EndPattern/,$!d;/StartPattern/q' | tac

If your system doesn't have GNU tac, you may be able to use tail -r instead.

You can also do it like:

awk '
  inside {
    text = text $0 RS
    if (/EndPattern/) inside=0
    next
  }
  /StartPattern/ {
    inside = 1
    text = $0 RS
  }
  END {printf "%s", text}' < filename

But that means reading the whole file.

Note that it may give different results if there's another StartPattern in between a StartPattern and the next EndPattern or if the last StartPattern does not have an ending EndPattern or if there are lines matching both StartPattern and EndPattern.

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {printf "%s", text}' < filename

Would make it behave more like the tac+sed+tac approach (except for the unclosed trailing StartPattern case).

That last one seems to be the closest to your edited requirements. To add the warning would simply be:

awk '
  /StartPattern/ {
    inside = 1
    text = ""
  }
  inside {text = text $0 RS}
  /EndPattern/ {inside = 0} 
  END {
    printf "%s", text
    if (inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
  }' < filename

To avoid reading the whole file:

tac < filename | awk '
  /StartPattern/ {
    printf "%s", $0 RS text
    if (!inside)
      print "Warning: EOF reached without seeing the end pattern" > "/dev/stderr"
    exit
  }
  /EndPattern/ {inside = 1; text = ""}
  {text = $0 RS text}'

Portability note: for /dev/stderr, you need either a system with such a special file (beware that on Linux if stderr is open on a seekable file that will write the text at the beginning of the file instead of the current position within the file) or an awk implementation that emulates it like gawk, mawk or busybox awk (those work around the Linux issue mentioned above).

On other systems, you can replace print ... > "/dev/stderr" with print ... | "cat>&2".

Best Answer

Related Solutions

Grep – How to Find Text Between Two Strings

How to Get Last Occurrence of Lines Between Two Patterns

Related Question