Greedy and lazy regular expressions (comprehension question)

command linegrepregular expression

I am teaching myself regular expressions, and I got stuck at »greedy« vs. »lazy« repeatings.

What I found out so far is that

  • »greedy« means that the RegExp looks for as many matches as possible, where
  • »lazy« means that the RegExp looks for as little matches as possible

Most articles I found deal with a) using it in a programming language, while I am stuck here with grep and egrep or b) use grep -P to activate Perl Mode; but as I don't have any knowledge about Perl yet this isn't very helpful for me.

My comprehension question: I came to this sledgehammer method:

  • lazy repetitions will look for the shortest possible match
  • if results are too long → tone down the repeater with ?
  • if results are still too long → look for another solution

This is what I was able to figure out through examples and experiments with HTML code where I got to some but not overwhelming results.

I would be grateful if somebody could tell me if and where I missed some important points with my summary.

Best Answer

It's not the shortest possible match, just a short match. Greedy mode tries to find the last possible match, lazy mode the first possible match. But the first possible match is not necessarily the shortest one.

Take the input string foobarbaz and the regexp o.*a (greedy) or o.*?a (lazy).

The shortest possible match in this input string would be oba.

However the RegExp looks for matches from left to right, so the o finds the first o in foobarbaz. And if the rest of the pattern produces a match, that's where it stays.

Following the first o, .* (greedy) eats obarbaz (the entire string) and then backtracks in order to match the rest of the pattern (a). Thus it finds the last a in baz and ends up matching oobarba.

Following the first o, .*? (lazy) doesn't eat the entire string, instead it looks for the first occurrence of the rest of the pattern. So first it sees the second o, which doesn't match a, then it sees b, which doesn't match a, then it sees a, which matches a, and because it's lazy that's where it stops. (and the result is ooba, but not oba)

So while it's not THE shortest possible one, it's a shorter one than the greedy version.

Related Question