Where to get the new string after running `sub` in awk

awk

From The Awk Programming Language

The function sub ( r, s , t ) first finds the leftmost longest
substring matched by the regular expression r in the target
string t; it then replaces the substring by the substitution
string s.

The function sub(r,s) is a synonym for sub(r,s,$0).

In sub ( /ana/, "anda" , "banana" ), for example, banana is
replaced with bandada.

After running sub ( r, s , t ), how can I get the new string?
For example, in sub ( /ana/, "anda" , "banana" ), how can I get the new string bandada?

The sub function returns the number of substitutions made.

Is the return of sub either 0 or 1? Is it correct that it can't be more than one, because sub only find the first match and replace it?

Thanks.

Best Answer

From the GNU awk manual 9.1.3 String-Manipulation Functions:

... the third argument to sub() must be a variable, field, or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions like the following:

sub(/USA/, "United States", "the USA and Canada")

For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

So, the answer is to use a variable:

awk 'BEGIN{t = "banana"; sub(/ana/,"anda",t); print t}'
bandana

Related Solutions

AWK Regular Expressions – Reduce Greediness

If you want to select @ and up to the first , after that, you need to specify it as @[^,]*,

That is @ followed by any number (*) of non-commas ([^,]) followed by a comma (,).

That approach works as the equivalent of @.*?,, but not for things like @.*?string, that is where what's after is more than a single character. Negating a character is easy, but negating strings in regexps is a lot more difficult.

A different approach is to pre-process your input to replace or prepend the string with a character that otherwise doesn't occur in your input:

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

If you can't guarantee that the input won't contain your replacement character (\1 above), one approach is to use an escaping mechanism:

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

That works for fixed strings but not for arbitrary regexps like for the equivalent of @.*?foo.bar.

Gawk: Passing arrays to functions

Function parameters are local to the function.

awk '
    function foo(x,y) {y=x*x; print "y in function: "y} 
    BEGIN {foo(2); print "y out of function: " y}
'

y in function: 4
y out of function:

If you pass fewer values to a function than there are parameters, the extra parameters are just empty. You might sometimes see functions defined like

function foo(a, b, c            d, e, f) {...

where the parameters after the whitespace are local variables and are not intended to take a value at invocation.

No reason why this can't work for local arrays:

awk '
    function bar(x) {
        split("hello world", x)
        print "in: " x[1]
    }
    BEGIN {
        x[1]="world"
        bar()
        print "out: " x[1]}
'

in: hello
out: world

Best Answer

Related Solutions

AWK Regular Expressions – Reduce Greediness

Gawk: Passing arrays to functions

Related Question