Text Processing Sed Awk Gawk – String Replacement Using a Dictionary

awkgawksedtext processing

What is a good way to do string replacements in a file using a dictionary with a lot of substituend-substituent pairs? And by a lot, I actually mean about 20 – not much, but many enough so that I want to organize them neatly.

I kind of want to collect all substituend-substituent pairs in a file dictionary.txt in an easy-to-manage way, since I need to replace a lot of stuff, say like:

"yes"      : "no"
"stop"     : "go, go, go!"
"wee-ooo"  : "ooooh nooo!"
"gooodbye" : "hello"

"high"     : "low"
"why?"     : "i don't know"

Now I want to apply these substitutions in some file novel.txt.

Then I want to run magiccommand --magicflags dictionary.txt novel.txt so that all instances of yes in novel.txt are replaced by no (so even Bayesian would be replaced by Banoian) and all instances of goodbye in novel.txt would be replaced by hello and so forth.

So far, the strings I need to replace (and replace with) do not have any quotes (neither single nor double) in them. (It would be nice, though, to see a solution working well with strings containing quotes, of course.)

I know sed and awk / gawk can do such stuff principally, but can they also work with such dictionary files? Seems like gawk would be the right candidate for magiccommand, what are the right magicflags? How do I need to format my dictionary.txt?

Best Answer

Here's one way with sed:

sed '
s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|
h
s|.*\n||
s|[\&/]|\\&|g
x
s|\n.*||
s|[[\.*^$/]|\\&|g
G
s|\(.*\)\n\(.*\)|s/\1/\2/g|
' dictionary.txt | sed -f - novel.txt

How it works:
The 1st sed turns dictionary.txt into a script-file (editing commands, one per line). This is piped to the 2nd sed (note the -f - which means read commands from stdin) that executes those commands, editing novel.txt.
This requires translating your format

"STRING"   :   "REPLACEMENT"

into a sed command and escaping any special characters in the process for both LHS and RHS:

s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

So the first substitution

s|"\(.*\)"[[:blank:]]*:[[:blank:]]*"\(.*\)"|\1\
\2|

turns "STRING" : "REPLACEMENT" into STRING\nREPLACEMENT (\n is a newline char). The result is then copied over the hold space.
s|.*\n|| deletes the first part keeping only REPLACEMENT then s|[\&/]|\\&|g escapes the reserved characters (this is the RHS).
It then exchanges the hold buffer with the pattern space and s|\n.*|| deletes the second part keeping only STRING and s|[[\.*^$/]|\\&|g does the escaping (this is the LHS).
The content of the hold buffer is then appended to pattern space via G so now the pattern space content is ESCAPED_STRING\nESCAPED_REPLACEMENT.
The final substitution

s|\(.*\)\n\(.*\)|s/\1/\2/g|

transforms it into s/ESCAPED_STRING/ESCAPED_REPLACEMENT/g

Related Question