I'm looking for a way to replace placeholder strings in a template file with concrete values, with common Unix tools (bash, sed, awk, maybe perl). It is important that the replacement is done in a single pass, that is, what is already scanned/replaced must not be considered for another replacement. For example, these two attempts fail:
echo "AB" | awk '{gsub("A","B");gsub("B","A");print}'
>> AA
echo "AB" | sed 's/A/B/g;s/B/A/g'
>> AA
The correct result in this case is of course BA.
In general, the solution should be equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings, and for each match, performing a replacement and continuing from that point on in the input (none of the already read input nor the replacements performed should be considered for matches). Actually, the details don't matter, just that the results of the replacement are never considered for another replacement, in whole or in part.
NOTE I am only looking for correct generic solutions. Please do not propose solutions which fail for certain inputs (input files, search and replace pairs), however unlikely they may seem.
Best Answer
OK, a general solution. The following bash function requires
2k
arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.Neither placeholders nor replacements may contain NUL characters, but you may use standard C
\
-escapes such as\0
if you needNUL
s (and consequently you are required to write\\
if you want a\
).It requires the standard build tools which should be present on a posix-like system (lex and cc).
We assume that
\
is already escaped if necessary in the arguments but we need to escape double quotes, if present. That's what the second argument to the second printf does. Since thelex
default action isECHO
, we don't need to worry about it.Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):
For larger inputs it might be useful to provide an optimization flag to
cc
, and for current Posix compatibility, it would be better to usec99
. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.Edit
If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs: