Text Processing – Finding Text Between Two Specific Characters or Strings

regular expressiontext processing

Say I have lines like this:

*[234]*
*[23]*
*[1453]*

where * represents any string (except a string of the form [number]). How can I parse these lines with a command line utility and extract the number between brackets?

More generally, which of these tools cut, sed, grep or awk would be appropriate for such task?

Best Answer

If you have GNU grep, you can use its -o option to search for a regex and output only the matching part. (Other grep implementations can only show the whole line.) If there are several matches on one line, they are printed on separate lines.

grep -o '\[[0-9]*\]'

If you only want the digits and not the brackets, it's a little harder; you need to use a zero-width assertion: a regexp that matches the empty string, but only if it is preceded, or followed as the case may be, by a bracket. Zero-width assertions are only available in Perl syntax.

grep -P -o '(?<=\[)[0-9]*(?=\])'

With sed, you need to turn off printing with -n, and match the whole line and retain only the matching part. If there are several possible matches on one line, only the last match is printed. See Extracting a regex matched with 'sed' without printing the surrounding characters for more details on using sed here.

sed -n 's/^.*\(\[[0-9]*\]\).*/\1/p'

or if you only want the digits and not the brackets:

sed -n 's/^.*\[\([0-9]*\)\].*/\1/p'

Without grep -o, Perl is the tool of choice here if you want something that's both simple and comprehensible. On every line (-n), if the line contains a match for \[[0-9]*\], then print that match ($&) and a newline (-l).

perl -l -ne '/\[[0-9]*\]/ and print $&'

If you only want the digits, put parentheses in the regex to delimit a group, and print only that group.

perl -l -ne '/\[([0-9]*)\]/ and print $1'

P.S. If you only want to require one or more digits between the brackets, change [0-9]* to [0-9][0-9]*, or to [0-9]+ in Perl.

Related Question