AWK – How to Use Regex with AWK for String Replacement

awkregular expressiontext processing

Suppose there is some text from a file:

(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 23" "#2")
("Exercises 31" "#30")
("Notes and References 42" "#34"))
)

I want to add 11 to each number followed by a " in each line if there is one, ie

(bookmarks
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 34" "#13")
("Exercises 42" "#41")
("Notes and References 53" "#45"))
)

Here is my solution by using GNU AWK and regex:

awk -F'#' 'NF>1{gsub(/"(\d+)\""/, "\1+11\"")}'

i.e., I want to replace (\d+)\" with \1+10\", where \1 is the group representing (\d+). But it doesn't work. How can I make it work?

If gawk is not the best solution, what else can be used?

Best Answer

Try this (gawk is needed).

awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}' YourFile

Test with your example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 2" "#2")
("Exercises 30" "#30")
("Notes and References 34" "#34"))
)
'|awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}'   
(bookmarks
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 13" "#13")
("Exercises 41" "#41")
("Notes and References 45" "#45"))
)

Note that this command won't work if the two numbers (e.g. 1" and "#1") are different. or there are more numbers in same line with this pattern (e.g. 23" ...32"..."#123") in one line.


UPDATE

Since @Tim (OP) said the number followed by " in same line could be different, I did some changes on my previous solution, and made it work for your new example.

BTW, from the example I feel that it could be a table of content structure, so I don't see how the two numbers could be different. First would be the printed page number, and 2nd with # would be the page index. Am I right?

Anyway, you know your requirement best. Now the new solution, still with gawk (I break the command into lines to make it easier to read):

awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        b=gensub(/([0-9]+)\"/,"\\1","g",$2); 
        gsub(/[0-9]+$/,a+11,$1);
        gsub(/^[0-9]+/,b+11,$2);
        print $1,$2
}' yourFile

test with your new example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 23" "#2")
("Exercises 31" "#30")
("Notes and References 42" "#34"))
)
'|awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        b=gensub(/([0-9]+)\"/,"\\1","g",$2); 
        gsub(/[0-9]+$/,a+11,$1);
        gsub(/^[0-9]+/,b+11,$2);
        print $1,$2
}'                        
(bookmarks
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 34" "#13")
("Exercises 42" "#41")
("Notes and References 53" "#45"))
)


EDIT2 based on @Tim 's comment

(1) Does FS=OFS="\" \"#" mean the separator of field in both input and output is double quote, space, double quote and #? Why specify double quote twice?

You are right for the separator in both input and output part. It defined separator as:

" "#

There are two double quotes, because it is easier to catch the two numbers you want (based on your example input).

(2) In /.* ([0-9]+)$/, does $ mean the end of the string?

Exactly!

(3) In the third argument of gensub(), what is the difference between "g" and "G"? there is no difference between G and g. Check this out:

gensub(regexp, replacement, how [, target]) #
    Search the target string target for matches of the regular expression regexp. 
    If "how" is a string beginning with ‘g’ or ‘G’ (short for “global”), then 
        replace all matches of regexp with replacement.

This is from http://www.gnu.org/s/gawk/manual/html_node/String-Functions.html. you can read to get detailed usage of gensub.

Related Question