Bash – Replace multiple strings in a single pass

awkbashreplacesedstring

I'm looking for a way to replace placeholder strings in a template file with concrete values, with common Unix tools (bash, sed, awk, maybe perl). It is important that the replacement is done in a single pass, that is, what is already scanned/replaced must not be considered for another replacement. For example, these two attempts fail:

echo "AB" | awk '{gsub("A","B");gsub("B","A");print}'
>> AA

echo "AB" | sed 's/A/B/g;s/B/A/g'
>> AA

The correct result in this case is of course BA.

In general, the solution should be equivalent to scanning the input left-to-right for a longest match to one of the given replacement strings, and for each match, performing a replacement and continuing from that point on in the input (none of the already read input nor the replacements performed should be considered for matches). Actually, the details don't matter, just that the results of the replacement are never considered for another replacement, in whole or in part.

NOTE I am only looking for correct generic solutions. Please do not propose solutions which fail for certain inputs (input files, search and replace pairs), however unlikely they may seem.

Best Answer

OK, a general solution. The following bash function requires 2k arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.

Neither placeholders nor replacements may contain NUL characters, but you may use standard C \-escapes such as \0 if you need NULs (and consequently you are required to write \\ if you want a \).

It requires the standard build tools which should be present on a posix-like system (lex and cc).

replaceholder() {
  local dir=$(mktemp -d)
  ( cd "$dir"
    { printf %s\\n "%option 8bit noyywrap nounput" "%%"
      printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
      printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
    } | lex && cc lex.yy.c
  ) && "$dir"/a.out
  rm -fR "$dir"
}

We assume that \ is already escaped if necessary in the arguments but we need to escape double quotes, if present. That's what the second argument to the second printf does. Since the lex default action is ECHO, we don't need to worry about it.

Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):

$ time echo AB | replaceholder A B B A
BA

real    0m0.128s
user    0m0.106s
sys     0m0.042s
$ time printf %s\\n AB{0000..9999} | replaceholder A B B A > /dev/null

real    0m0.118s
user    0m0.117s
sys     0m0.043s

For larger inputs it might be useful to provide an optimization flag to cc, and for current Posix compatibility, it would be better to use c99. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.

Edit

If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs:

treplaceholder () { 
  tcc -run <(
  {
    printf %s\\n "%option 8bit noyywrap nounput" "%%"
    printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
    printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
  } | lex -t)
}

$ time printf %s\\n AB{0000..9999} | treplaceholder A B B A > /dev/null

real    0m0.039s
user    0m0.041s
sys     0m0.031s
Related Question