AWK – How to Use Regex with AWK for String Replacement

awkregular expressiontext processing

Suppose there is some text from a file:

("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 23" "#2")
("Exercises 31" "#30")
("Notes and References 42" "#34"))

I want to add 11 to each number followed by a " in each line if there is one, ie

("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 34" "#13")
("Exercises 42" "#41")
("Notes and References 53" "#45"))

Here is my solution by using GNU AWK and regex:

awk -F'#' 'NF>1{gsub(/"(\d+)\""/, "\1+11\"")}'

i.e., I want to replace (\d+)\" with \1+10\", where \1 is the group representing (\d+). But it doesn't work. How can I make it work?

If gawk is not the best solution, what else can be used?

Best Answer

Try this (gawk is needed).

awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}' YourFile

Test with your example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 2" "#2")
("Exercises 30" "#30")
("Notes and References 34" "#34"))
'|awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}'   
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 13" "#13")
("Exercises 41" "#41")
("Notes and References 45" "#45"))

Note that this command won't work if the two numbers (e.g. 1" and "#1") are different. or there are more numbers in same line with this pattern (e.g. 23" ...32"..."#123") in one line.


Since @Tim (OP) said the number followed by " in same line could be different, I did some changes on my previous solution, and made it work for your new example.

BTW, from the example I feel that it could be a table of content structure, so I don't see how the two numbers could be different. First would be the printed page number, and 2nd with # would be the page index. Am I right?

Anyway, you know your requirement best. Now the new solution, still with gawk (I break the command into lines to make it easier to read):

awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        print $1,$2
}' yourFile

test with your new example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 23" "#2")
("Exercises 31" "#30")
("Notes and References 42" "#34"))
'|awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        print $1,$2
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 34" "#13")
("Exercises 42" "#41")
("Notes and References 53" "#45"))

EDIT2 based on @Tim 's comment

(1) Does FS=OFS="\" \"#" mean the separator of field in both input and output is double quote, space, double quote and #? Why specify double quote twice?

You are right for the separator in both input and output part. It defined separator as:

" "#

There are two double quotes, because it is easier to catch the two numbers you want (based on your example input).

(2) In /.* ([0-9]+)$/, does $ mean the end of the string?


(3) In the third argument of gensub(), what is the difference between "g" and "G"? there is no difference between G and g. Check this out:

gensub(regexp, replacement, how [, target]) #
    Search the target string target for matches of the regular expression regexp. 
    If "how" is a string beginning with ‘g’ or ‘G’ (short for “global”), then 
        replace all matches of regexp with replacement.

This is from you can read to get detailed usage of gensub.

Related Question