AWK Regular Expressions – Reduce Greediness

awkregular expression

I want to do non-greedy pattern (regular expression) matching in awk.
Here is an example:

echo "@article{gjn, Author =   {Grzegorz J. Nalepa}, " | awk '{ sub(/@.*,/,""); print }'

Is it possible to write a regular expression that selects the shorter string?

@article{gjn,

instead of this long string?:

@article{gjn, Author =   {Grzegorz J. Nalepa},

I want to get this result:

 Author =   {Grzegorz J. Nalepa},

I have another example:

echo ",article{gjn, Author =   {Grzegorz J. Nalepa}, " | awk '{ sub(/,[^,]*,/,""); print }'
      ↑                                                              ↑^^^^^

Note that I changed the @ characters to comma (,) characters
in the first position of both the input string and the regular expression
(and also changed .* to [^,]*).
Is it possible to write a regular expression that selects the shorter string?

, Author =   {Grzegorz J. Nalepa},

instead of the longer string?:

,article{gjn, Author =   {Grzegorz J. Nalepa},

I want to get this result:

,article{gjn

Best Answer

If you want to select @ and up to the first , after that, you need to specify it as @[^,]*,

That is @ followed by any number (*) of non-commas ([^,]) followed by a comma (,).

That approach works as the equivalent of @.*?,, but not for things like @.*?string, that is where what's after is more than a single character. Negating a character is easy, but negating strings in regexps is a lot more difficult.

A different approach is to pre-process your input to replace or prepend the string with a character that otherwise doesn't occur in your input:

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

If you can't guarantee that the input won't contain your replacement character (\1 above), one approach is to use an escaping mechanism:

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

That works for fixed strings but not for arbitrary regexps like for the equivalent of @.*?foo.bar.

Related Solutions

Where to get the new string after running `sub` in awk

From the GNU awk manual 9.1.3 String-Manipulation Functions:

... the third argument to sub() must be a variable, field, or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions like the following:

sub(/USA/, "United States", "the USA and Canada")

For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

So, the answer is to use a variable:

awk 'BEGIN{t = "banana"; sub(/ana/,"anda",t); print t}'
bandana

Bash – How does storing the regular expression in a shell variable avoid problems with quoting characters that are special to the shell

[[ ... ]] tokenisation clashes with regular expressions (more on that in my answer to your follow-up question) and \ is overloaded as a shell quoting operator and a regexp operator (with some interference between the two in bash), and even when there's no apparent reason for a clash, the behaviour can be surprising. Rules can be confusing.

Who can tell what these will do without trying it (on all possible input) with any given version of bash?

[[ $a = a|b ]]
[[ $a =~ a|b ]]
[[ $a =~ a&b ]]
[[ $a =~ (a|b) ]]
[[ $a =~ ([)}]*) ]]
[[ $a =~ [/\(] ]]
[[ $a =~ \s+ ]]
[[ $a =~ ( ) ]]
[[ $a =~ [ ] ]]
[[ $a =~ ([ ]) ]]

You can't quote the regexps, because if you do, since bash 3.2 and if bash 3.1 compatibility has not been enabled, quoting the regexps removes the special meaning of RE operator. For instance,

[[ $a =~ 'a|b' ]]

Matches if $a contains a litteral a|b only.

Storing the regexp in a variable avoids all those problems and also makes the code compatible to ksh93 and zsh (provided you limit yourself to POSIX EREs):

regexp='a|b'
[[ $a =~ $regexp ]] # $regexp should *not* be quoted.

There's no ambiguity in the parsing/tokenising of that shell command, and the regexp that is used is the one stored in the variable without any transformation.

Best Answer

Related Solutions

Where to get the new string after running `sub` in awk

Bash – How does storing the regular expression in a shell variable avoid problems with quoting characters that are special to the shell

Related Question