Greedy and lazy regular expressions (comprehension question)

command linegrepregular expression

I am teaching myself regular expressions, and I got stuck at »greedy« vs. »lazy« repeatings.

What I found out so far is that

»greedy« means that the RegExp looks for as many matches as possible, where
»lazy« means that the RegExp looks for as little matches as possible

Most articles I found deal with a) using it in a programming language, while I am stuck here with grep and egrep or b) use grep -P to activate Perl Mode; but as I don't have any knowledge about Perl yet this isn't very helpful for me.

My comprehension question: I came to this sledgehammer method:

lazy repetitions will look for the shortest possible match
if results are too long → tone down the repeater with ?
if results are still too long → look for another solution

This is what I was able to figure out through examples and experiments with HTML code where I got to some but not overwhelming results.

I would be grateful if somebody could tell me if and where I missed some important points with my summary.

Best Answer

It's not the shortest possible match, just a short match. Greedy mode tries to find the last possible match, lazy mode the first possible match. But the first possible match is not necessarily the shortest one.

Take the input string foobarbaz and the regexp o.*a (greedy) or o.*?a (lazy).

The shortest possible match in this input string would be oba.

However the RegExp looks for matches from left to right, so the o finds the first o in foobarbaz. And if the rest of the pattern produces a match, that's where it stays.

Following the first o, .* (greedy) eats obarbaz (the entire string) and then backtracks in order to match the rest of the pattern (a). Thus it finds the last a in baz and ends up matching oobarba.

Following the first o, .*? (lazy) doesn't eat the entire string, instead it looks for the first occurrence of the rest of the pattern. So first it sees the second o, which doesn't match a, then it sees b, which doesn't match a, then it sees a, which matches a, and because it's lazy that's where it stops. (and the result is ooba, but not oba)

So while it's not THE shortest possible one, it's a shorter one than the greedy version.

Related Solutions

Grep Performance – Search in Thousands of Files Efficiently

With find:

cd /the/dir
find . -type f -exec grep pattern {} +

(-type f is to only search in regular files (also excluding symlinks even if they point to regular files). If you want to search in any type of file except directories (but beware there are some types of files like fifos or /dev/zero that you generally don't want to read), replace -type f with the GNU-specific ! -xtype d (-xtype d matches for files of type directory after symlink resolution)).

With GNU grep:

grep -r pattern /the/dir

(but beware that unless you have a recent version of GNU grep, that will follow symlinks when descending into directories). Non-regular files won't be searched unless you add a -D read option. Recent versions of GNU grep will still not search inside symlinks though.

Very old versions of GNU find did not support the standard {} + syntax, but there you could use the non-standard:

cd /the/dir &&
  find . -type f -print0 | xargs -r0 grep pattern

Performances are likely to be I/O bound. That is the time to do the search would be the time needed to read all that data from storage.

If the data is on a redundant disk array, reading several files at a time might improve performance (and could degrade them otherwise). If the performances are not I/O bound (because for instance all the data is in cache), and you have multiple CPUs, concurrent greps might help as well. You can do that with GNU xargs's -P option.

For instance, if the data is on a RAID1 array with 3 drives, or if the data is in cache and you have 3 CPUs whose time to spare:

cd /the/dir &&
  find . -type f -print0 | xargs -n1000 -r0P3 grep pattern

(here using -n1000 to spawn a new grep every 1000 files, up to 3 running in parallel at a time).

However note that if the output of grep is redirected, you'll end up with badly interleaved output from the 3 grep processes, in which case you may want to run it as:

find . -type f -print0 | stdbuf -oL xargs -n1000 -r0P3 grep pattern

(on a recent GNU or FreeBSD system) or use the --line-buffered option of GNU grep.

If pattern is a fixed string, adding the -F option could improve matters.

If it's not multi-byte character data, or if for the matching of that pattern, it doesn't matter whether the data is multi-byte character or not, then:

cd /the/dir &&
  LC_ALL=C grep -r pattern .

could improve performance significantly.

If you end up doing such searches often, then you may want to index your data using one of the many search engines out there.

Shell Grep Regular Expression – Escaping * with Regular Expressions and Grep

So try :

egrep "^\*\*" YOUR_FILE

Don't forget to use double quote.

Note: Use egrep instead of grep.
If you want to use grep use grep -E

Best Answer

Related Solutions

Grep Performance – Search in Thousands of Files Efficiently

Shell Grep Regular Expression – Escaping * with Regular Expressions and Grep

Related Question