Shell – Perl regex get word between a pattern

perlregular expressionshell-script

I have a working perl regex using grep. I am trying to understand how it works.

Here is the command command.

grep -oP '(?<=location>)[^<]+' testFile1.xml

Here are the contents of testFile1.xml

<con:location>C:/test/file1.txt</con:location></con:dataFile>/con:dataFiles></con:groupFile>

And this is the result

C:/test/file1.txt

I am trying to understand the regex, i.e. this part (?<=location>)[^<]+

Best Answer

(?<=...) is a look-behind PCRE operator. By itself, it doesn't match anything but acts as a condition (that what's on the left matches ...).

(?<=X)Y matches Y provided that what's on the left matches X. In blahYfooXYbar, that matches the second Y, the X is not part of what is being matched. The (?<=X) itself matches the zero-width (imaginary) spot just before that Y. Here illustrated:

$ echo X-RAY THE FOX | perl -lpe 's/(?<=X)/<there>/g'
X<there>-RAY THE FOX<there>

Because with -o, grep only prints the matched portion, that's a way to make it print what's after the location> (here what matches [^>]+: one or more (+) non-< characters ([^>]) so everything up to (but not included) the next < character or the end of the line provided it's not empty).

Another approach is to use \K (in newer versions of PCRE) to reset the start of the matched portion:

grep -Po 'location>\K[^>]+'

Note that -P and -o are GNU extensions. With recent versions (8.11 or over) of pcregrep (another grep implementation that uses PCRE), you can also do:

pcregrep -o1 'location>([^>]+)'

(-o1 prints what's captured by the 1st (here unique) (...))

Related Question