Shell – sed regex for capture group between delimeters

command linelinuxregular expressionsedshell-script

I have a two line file that I'm trying to get some info out of for a bash script using sed.

# File Comment
PrefixForInformation {information to be captured}

I need to get the information between but not including the curly braces.
I have the PCRE regex /{(.*)}/ or \s{([^}]*) that seems to work in Online Regex 101 but I can't get that over to a working sed configuration.

Best Answer

$ sed -n 's/.*{\(.*\)}.*/\1/p' file
information to be captured

How it works

-n

This tells sed not to print anything unless we explicitly ask it to.
s/.*{$.*$}.*/\1/p

This substitute command captures as group 1 everything between two curly braces. The whole line is replaced with group 1, denoted \1. The p at the end tells sed that, if a match was made, it should print the result.

Related Solutions

Ubuntu – sed regex issue

I don't have any problem with [[:space:]]. Here's a really silly little example showing the mixed-replacement of spaces and tabs:

$ echo -e 'A \t \t B' | sed 's/A[[:space:]]*B/WORKED/'
WORKED

You can also use \s which is often preferable with big sed strings because it's much shorter:

$ echo -e 'A \t \t B' | sed 's/A\s*B/WORKED/'
WORKED

Anyway, I think your actual problem is escaping those troublesome single quotes. I find the easiest way is to break out of the single quote string and have a double-quoted single quote and then (if needed) go back into the single quote line. Bash will automatically concatenate this all up for you.

$ echo 'This is a nice string and this is a single quote:'"'"' Nice?'
This is a nice string and this is a single quote:' Nice?

So all the space we saved with \s is about to get destroyed by this mega-quote situation:

$ echo -e '$RELEASE  \t = '"'"'1234'"'"';' |\
  sed 's/$RELEASE\s*=\s*'"'"'[0-9]*'"'"'\;/REPLACEMENT/'

Of course there is an argument that (because this looks like a PHP script) that you might be able to assume that if the line starts with $RELEASE[\s=]+ you can just replace the whole line. Not always true obviously (the entire app could be one hideous line) but it makes your search and replace more palatable:

sed 's/$RELEASE[\s=]*.*/REPLACEMENT/'

And yes, general sed usage rules apply. Don't echo into a stream-editor (like sed) and redirect back into that file. If it works you could easily knacker the file.

Either use the -i argument (works for sed) or pipe into a application like sponge (which is like a delayed output):

sed -i '...' file
sed '...' file | sponge file

PCRE-regex Use grep to exclude a capturing group

grep's name comes after the g/re/p ed command. Its primary purpose is to print the lines that match a regexp. It's not its role to edit the content of those lines. You have sed (the stream editor) or awk for that.

Now, some grep implementations, starting with GNU grep added a -o option to print the matched portion of each line (what is matched by the regexp, not its capture groups). You've got some grep implementation like GNU's again (with -P) or pcregrep that support PCREs for their regexps.

pcregrep actually added a -o<n> option to print the content of a capture group. So you could do:

pcregrep -o1 -o2 --om-separator=' ' '.zoo.(\d+).*:\s+(.*)'

But here, the obvious standard solution is to use sed:

sed -n 's/^.*\.zoo\.\([0-9]\{1,\}\).*:[[:space:]]\{1,\}/\1 /p'

Or if you want perl regexps, use perl:

perl -lne 'print "$1 $2" if /\.zoo\.(\d+).*:\s+(.*)/'

With GNU grep, if you don't mind the matches to appear on different lines, you can do:

$ grep -Po '\.zoo\.\K\d+|:\s+\K.*' < file
2
0.45654343

Note that while \K resets the start of the matched portion, that doesn't mean you can get away with the two parts of the alternation overlapping.

grep -Po '.zoo.(\K\d+|.: \K.)'

would not work, just like echo foobar | grep -Po 'foo|foob' wouldn't work (at printing both foo and foob). foo|foob first matches foo and then grep looks for potential other matches in the input after the foo, so starting at the b of bar, so can't find any more after that.

Above with grep -Po '\.zoo\.\K\d+|:\s+\K.*', we only look for :<spaces><anything> in the second part of the alternation. That does match in the part that is after .zoo.<digits> but that also means it would find those :<spaces><anything> anywhere in the input, not only when they follow .zoo.<digits>.

There is a way to work around that though, using another PCRE special operator: \G. \G matches at the start of the subject. For a single match, that's equivalent to ^, but with multiple matches (think of sed/perl's g flag in s/.../.../g) like with -o where grep tries to find all the matches in the line, that also matches after the end of the previous match. So if you make it:

grep -Po '\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

Where (?!^) is a negative look-ahead operator that means not at the beginning of the line, that \G will only match after a previous successful (non-empty) match, so .*:\s+\K.* will only match if it follows a previous successful match, and that can only be the .foo.<digits> one since the other part of the alternation matches til the end of the line.

On an input like:

.zoo.1.zoo.2 tar: blah

That would output:

1
2
blah

Though. If you did not want that, you'd also want the first part of the alternation to only match at the beginning of the line. Something like

grep -Po '^.*?\.zoo\.\K\d+|(?!^)\G.*:\s+\K.*'

That still outputs 2 on an input like .zoo.2 no colon character or .zoo.2 blah:. Which you could work around with a look-ahead operator in the first part of the alternation, and look for at least one non-space after :<spaces> (and also using $ to avoid issues with non-characters)

grep -Po '^.*?\.zoo\.\K\d+(?=.*:\s+\S.*$)|(?!^)\G.*:\s+\K\S.*$'

You'd probably need a few pages of comments to explain that regexp, so I would still go for the straightfoward sed/perl solutions...

Best Answer

How it works

Related Solutions

Ubuntu – sed regex issue

PCRE-regex Use grep to exclude a capturing group

Related Question