Text Processing Sed Awk Gawk – String Replacement Using a Dictionary

awkgawksedtext processing

What is a good way to do string replacements in a file using a dictionary with a lot of substituend-substituent pairs? And by a lot, I actually mean about 20 – not much, but many enough so that I want to organize them neatly.

I kind of want to collect all substituend-substituent pairs in a file dictionary.txt in an easy-to-manage way, since I need to replace a lot of stuff, say like:

"yes"      : "no"
"stop"     : "go, go, go!"
"wee-ooo"  : "ooooh nooo!"
"gooodbye" : "hello"

"high"     : "low"
"why?"     : "i don't know"

Now I want to apply these substitutions in some file novel.txt.

Then I want to run magiccommand --magicflags dictionary.txt novel.txt so that all instances of yes in novel.txt are replaced by no (so even Bayesian would be replaced by Banoian) and all instances of goodbye in novel.txt would be replaced by hello and so forth.

So far, the strings I need to replace (and replace with) do not have any quotes (neither single nor double) in them. (It would be nice, though, to see a solution working well with strings containing quotes, of course.)

I know sed and awk / gawk can do such stuff principally, but can they also work with such dictionary files? Seems like gawk would be the right candidate for magiccommand, what are the right magicflags? How do I need to format my dictionary.txt?

Best Answer

Here's one way with sed:

sed '
s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|
h
s|.*\n||
s|[\&/]|\\&|g
x
s|\n.*||
s|[[\.*^$/]|\\&|g
G
s|\(.*\)\n\(.*\)|s/\1/\2/g|
' dictionary.txt | sed -f - novel.txt

How it works:
The 1st sed turns dictionary.txt into a script-file (editing commands, one per line). This is piped to the 2nd sed (note the -f - which means read commands from stdin) that executes those commands, editing novel.txt.
This requires translating your format

"STRING"   :   "REPLACEMENT"

into a sed command and escaping any special characters in the process for both LHS and RHS:

s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

So the first substitution

s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|

turns "STRING" : "REPLACEMENT" into STRING\nREPLACEMENT (\n is a newline char). The result is then copied over the hold space.
s|.*\n|| deletes the first part keeping only REPLACEMENT then s|[\&/]|\\&|g escapes the reserved characters (this is the RHS).
It then exchanges the hold buffer with the pattern space and s|\n.*|| deletes the second part keeping only STRING and s|[[\.*^$/]|\\&|g does the escaping (this is the LHS).
The content of the hold buffer is then appended to pattern space via G so now the pattern space content is ESCAPED_STRING\nESCAPED_REPLACEMENT.
The final substitution

s|\(.*\)\n\(.*\)|s/\1/\2/g|

transforms it into s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

Related Solutions

AWK – How to Use Regex with AWK for String Replacement

Try this (gawk is needed).

awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}' YourFile

Test with your example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 2" "#2")
("Exercises 30" "#30")
("Notes and References 34" "#34"))
)
'|awk '{a=gensub(/.*#([0-9]+)(\").*/,"\\1","g",$0);if(a~/[0-9]+/) {gsub(/[0-9]+\"/,a+11"\"",$0);}print $0}'   
(bookmarks
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 13" "#13")
("Exercises 41" "#41")
("Notes and References 45" "#45"))
)

Note that this command won't work if the two numbers (e.g. 1" and "#1") are different. or there are more numbers in same line with this pattern (e.g. 23" ...32"..."#123") in one line.

UPDATE

Since @Tim (OP) said the number followed by " in same line could be different, I did some changes on my previous solution, and made it work for your new example.

BTW, from the example I feel that it could be a table of content structure, so I don't see how the two numbers could be different. First would be the printed page number, and 2nd with # would be the page index. Am I right?

Anyway, you know your requirement best. Now the new solution, still with gawk (I break the command into lines to make it easier to read):

awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        b=gensub(/([0-9]+)\"/,"\\1","g",$2); 
        gsub(/[0-9]+$/,a+11,$1);
        gsub(/^[0-9]+/,b+11,$2);
        print $1,$2
}' yourFile

test with your new example:

kent$  echo '(bookmarks
("Chapter 1 Introduction 1" "#1"
("1.1 Problem Statement and Basic Definitions 23" "#2")
("Exercises 31" "#30")
("Notes and References 42" "#34"))
)
'|awk 'BEGIN{FS=OFS="\" \"#"}{if(NF<2){print;next;}
        a=gensub(/.* ([0-9]+)$/,"\\1","g",$1);
        b=gensub(/([0-9]+)\"/,"\\1","g",$2); 
        gsub(/[0-9]+$/,a+11,$1);
        gsub(/^[0-9]+/,b+11,$2);
        print $1,$2
}'                        
(bookmarks
("Chapter 1 Introduction 12" "#12"
("1.1 Problem Statement and Basic Definitions 34" "#13")
("Exercises 42" "#41")
("Notes and References 53" "#45"))
)

EDIT2 based on @Tim 's comment

(1) Does FS=OFS="\" \"#" mean the separator of field in both input and output is double quote, space, double quote and #? Why specify double quote twice?

You are right for the separator in both input and output part. It defined separator as:

" "#

There are two double quotes, because it is easier to catch the two numbers you want (based on your example input).

(2) In /.* ([0-9]+)$/, does $ mean the end of the string?

Exactly!

(3) In the third argument of gensub(), what is the difference between "g" and "G"? there is no difference between G and g. Check this out:

gensub(regexp, replacement, how [, target]) #
    Search the target string target for matches of the regular expression regexp. 
    If "how" is a string beginning with ‘g’ or ‘G’ (short for “global”), then 
        replace all matches of regexp with replacement.

This is from http://www.gnu.org/s/gawk/manual/html_node/String-Functions.html. you can read to get detailed usage of gensub.

Case matching pattern replacement with sed

Portable solution using sed:

sed '
:1
/[aA][bB][cC][dD][eE][fF]/!b
s//\
&\
pqrstu\
PQRSTU\
/;:2
s/\n[[:lower:]]\(.*\n\)\(.\)\(.*\n\).\(.*\n\)/\2\
\1\3\4/;s/\n[^[:lower:]]\(.*\n\).\(.*\n\)\(.\)\(.*\n\)/\3\
\1\2\4/;t2
s/\n.*\n//;b1'

It's a bit easier with GNU sed:

search=abcdef replace=pqrstuvwx
sed -r ":1;/$search/I!b;s//\n&&&\n$replace\n/;:2
    s/\n[[:lower:]](.*\n)(.)(.*\n)/\l\2\n\1\3/
    s/\n[^[:lower:]](.*\n)(.)(.*\n)/\u\2\n\1\3/;t2
    s/\n.*\n(.*)\n/\1/g;b1"

By using &&& above, we reuse the case pattern of the string for the rest of the replacement, So ABcdef would be changed to PQrstuVWx and AbCdEf to PqRsTuVwX. Change it to & to affect only the case of the first 6 characters.

(note that it may not do what you want or may run into an infinite loop if the replacement may be subject to substitution (for instance if substituting foo for foo, or bcd for abcd)

Best Answer

Related Solutions

AWK – How to Use Regex with AWK for String Replacement

Case matching pattern replacement with sed

Related Question