Shell – Perl regex get word between a pattern

perlregular expressionshell-script

I have a working perl regex using grep. I am trying to understand how it works.

Here is the command command.

grep -oP '(?<=location>)[^<]+' testFile1.xml

Here are the contents of testFile1.xml

<con:location>C:/test/file1.txt</con:location></con:dataFile>/con:dataFiles></con:groupFile>

And this is the result

C:/test/file1.txt

I am trying to understand the regex, i.e. this part (?<=location>)[^<]+

Best Answer

(?<=...) is a look-behind PCRE operator. By itself, it doesn't match anything but acts as a condition (that what's on the left matches ...).

(?<=X)Y matches Y provided that what's on the left matches X. In blahYfooXYbar, that matches the second Y, the X is not part of what is being matched. The (?<=X) itself matches the zero-width (imaginary) spot just before that Y. Here illustrated:

$ echo X-RAY THE FOX | perl -lpe 's/(?<=X)/<there>/g'
X<there>-RAY THE FOX<there>

Because with -o, grep only prints the matched portion, that's a way to make it print what's after the location> (here what matches [^>]+: one or more (+) non-< characters ([^>]) so everything up to (but not included) the next < character or the end of the line provided it's not empty).

Another approach is to use \K (in newer versions of PCRE) to reset the start of the matched portion:

grep -Po 'location>\K[^>]+'

Note that -P and -o are GNU extensions. With recent versions (8.11 or over) of pcregrep (another grep implementation that uses PCRE), you can also do:

pcregrep -o1 'location>([^>]+)'

(-o1 prints what's captured by the 1^st (here unique) (...))

Related Solutions

Bash – Forcing Bash to use Perl RegEx Engine

Bash doesn't support a method for you to do this at this time. You're left with the following options:

Use Perl
Use grep [-P|--perl-regexp]
Use Bash functionality to code it

I think I would go with #2 and try and use grep to get what I want functionally. For back referencing you can do the following with grep:

$ echo 'BEGIN `helloworld` END' | grep -oP '(?<=BEGIN `).*(?=` END)'
helloworld

-o, --only-matching       show only the part of a line matching PATTERN
-P, --perl-regexp         PATTERN is a Perl regular expression

(?=pattern)
    is a positive look-ahead assertion
(?!pattern)
    is a negative look-ahead assertion
(?<=pattern)
    is a positive look-behind assertion
(?<!pattern)
    is a negative look-behind assertion

References

How To Use Backreference in Bash

Best Answer

Related Solutions

Bash – Forcing Bash to use Perl RegEx Engine

References

Related Question