How or Why using `.*?` is better than `.*`

regular expressionwildcards

I answered this question on SuperUser that was something related to kind of Regular expressions used while grepping an output.

The answer I gave was this :

 tail -f log | grep "some_string.*some_string"

And then, In three comments to my answer @Bob wrote this :

.* is greedy and might capture more than you want. .*? is usually better.

Then this,

the ? is a modifier on *, making it lazy instead of the greedy default. Assuming PCRE.

I googled for PCRE, but couldn't get what's the significance of this in my answer ?

and finally this,

I should also point out that this is regex (grep doing POSIX regex by default), not a shell glob.

I only know what a Regex is and very basic usage of it in grep command. So, I couldn't get any of those 3 comments and I have these questions in mind :

  • What are differences in usage of .*? vs. .*?
  • Which is better and under what circumstance? Please provide examples.

Also It would be helpful to understand the comments, If anyone could


UPDATE: As an answer to question How are Regex different from Shell Globs ? @Kusalananda provided this link in his comment.

NOTE: If needed, Please read my answer to this question before answering for referring to the context.

Best Answer

Ashok already pointed out the difference between .* and .*?, so I'll just provide some additional information.

grep (assuming the GNU version) supports 4 ways to match strings:

  • Fixed strings, with the -F option
  • Basic regular expressions (BRE), default
  • Extended regular expressions (ERE), with the -E option
  • Perl-compatible regular expressions (PCRE), with the -P option in GNU grep

grep uses BRE by default.

BRE and ERE are documented in the Regular Expressions chapter of POSIX and PCRE is documented in its official website. Please note that features and syntax may vary between implementations.

It's worth saying that neither BRE nor ERE support lazyness:

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.

So if you want to use that feature, you'll need to use PCRE instead:

# PCRE greedy
$ grep -P -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants

# PCRE lazy
$ grep -P -o 'c.*?s' <<< 'can cats eat plants?'
can cats

Could you please explain a little about .* vs .*? ?

  • .* is used to match the "longest"1 pattern possible.

  • .*? is used to match the "shortest"1 pattern possible.

In my experience, the most wanted behavior is usually the second one.

For example, let's say we have the following string and we only want to match the html tags2, not the content between them:

<title>My webpage title</title>

Now compare .* vs .*?:

# Greedy
$ grep -P -o '<.*>' <<< '<title>My webpage title</title>'
<title>My webpage title</title>

# Lazy
$ grep -P -o '<.*?>' <<< '<title>My webpage title</title>'
<title>
</title>

1. The meaning of "longest" and "shortest" in a regex context is a bit tricky, as Kusalananda pointed out. Refer to official documentation for more information.
2. It's not recommended to parse html with regex. This is just an example for educational purposes, don't use it in production.