How or Why using `.?` is better than `.`

regular expressionwildcards

I answered this question on SuperUser that was something related to kind of Regular expressions used while grepping an output.

The answer I gave was this :

 tail -f log | grep "some_string.*some_string"

And then, In three comments to my answer @Bob wrote this :

.* is greedy and might capture more than you want. .*? is usually better.

Then this,

the ? is a modifier on *, making it lazy instead of the greedy default. Assuming PCRE.

I googled for PCRE, but couldn't get what's the significance of this in my answer ?

and finally this,

I should also point out that this is regex (grep doing POSIX regex by default), not a shell glob.

I only know what a Regex is and very basic usage of it in grep command. So, I couldn't get any of those 3 comments and I have these questions in mind :

What are differences in usage of .*? vs. .*?
Which is better and under what circumstance? Please provide examples.

Also It would be helpful to understand the comments, If anyone could

UPDATE: As an answer to question How are Regex different from Shell Globs ? @Kusalananda provided this link in his comment.

NOTE: If needed, Please read my answer to this question before answering for referring to the context.

Best Answer

Ashok already pointed out the difference between .* and .*?, so I'll just provide some additional information.

grep (assuming the GNU version) supports 4 ways to match strings:

Fixed strings, with the -F option
Basic regular expressions (BRE), default
Extended regular expressions (ERE), with the -E option
Perl-compatible regular expressions (PCRE), with the -P option in GNU grep

grep uses BRE by default.

BRE and ERE are documented in the Regular Expressions chapter of POSIX and PCRE is documented in its official website. Please note that features and syntax may vary between implementations.

It's worth saying that neither BRE nor ERE support lazyness:

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.

So if you want to use that feature, you'll need to use PCRE instead:

# PCRE greedy
$ grep -P -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants

# PCRE lazy
$ grep -P -o 'c.*?s' <<< 'can cats eat plants?'
can cats

Could you please explain a little about .* vs .*? ?

.* is used to match the "longest"¹ pattern possible.
.*? is used to match the "shortest"¹ pattern possible.

In my experience, the most wanted behavior is usually the second one.

For example, let's say we have the following string and we only want to match the html tags², not the content between them:

<title>My webpage title</title>

Now compare .* vs .*?:

# Greedy
$ grep -P -o '<.*>' <<< '<title>My webpage title</title>'
<title>My webpage title</title>

# Lazy
$ grep -P -o '<.*?>' <<< '<title>My webpage title</title>'
<title>
</title>

^{1. The meaning of "longest" and "shortest" in a regex context is a bit tricky, as Kusalananda pointed out. Refer to official documentation for more information.

2. It's not recommended to parse html with regex. This is just an example for educational purposes, don't use it in production.}

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Regular Expression – Definition and Explanation

As lk- said, the -name option of find will treat the argument as a glob, not a regular expression.

Whether a string is interpreted as a glob or a regex or just a plain string depends on what is being used to do the interpreting. It's a matter of context. The string in your example, [Rr]eading[Tt]est[Dd]ata can be evaluated in a number of different ways, but what it is depends on how you are using it. Use it as a glob, it's a glob. Use it as a regex, it's a regex. In the case of the question where this originated, the OP described the string as a regex. Therefore we can assume he was planning to interpret it as a regex.

A single character can also be a regex, absolutely. It can also be a string, and it can also be a glob. It could be interpreted as a byte or a tinyint, if you like. It all depends on context.

There are a number of specifications for regular expressions in various forms. BRE and ERE are well documented. PCRE adds scads of functionality. Many regex interpreters will implement, for example, "all of ERE and some of PCRE". Or they'll do ERE minus some feature. If you go by formal specifications, many many tools claim regex-support that turns out to be incorrect or incomplete. Knowing the details lets you adapt your solutions to the collection of functionality available within whatever tool is evaluating your regex.

So ... if you're looking for definitions that "exclude" globs, you're looking at this from the wrong perspective. What it is is determined by how you use it.

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

Regular Expression – Definition and Explanation

Related Question