OK, a general solution. The following bash function requires 2k
arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.
Neither placeholders nor replacements may contain NUL characters, but you may use standard C \
-escapes such as \0
if you need NUL
s (and consequently you are required to write \\
if you want a \
).
It requires the standard build tools which should be present on a posix-like system (lex and cc).
replaceholder() {
local dir=$(mktemp -d)
( cd "$dir"
{ printf %s\\n "%option 8bit noyywrap nounput" "%%"
printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex && cc lex.yy.c
) && "$dir"/a.out
rm -fR "$dir"
}
We assume that \
is already escaped if necessary in the arguments
but we need to escape double quotes, if present. That's what the
second argument to the second printf does. Since the lex
default action is ECHO
, we don't need to worry about it.
Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):
$ time echo AB | replaceholder A B B A
BA
real 0m0.128s
user 0m0.106s
sys 0m0.042s
$ time printf %s\\n AB{0000..9999} | replaceholder A B B A > /dev/null
real 0m0.118s
user 0m0.117s
sys 0m0.043s
For larger inputs it might be useful to provide an optimization flag to cc
, and for current Posix compatibility, it would be better to use c99
. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.
Edit
If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs:
treplaceholder () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
$ time printf %s\\n AB{0000..9999} | treplaceholder A B B A > /dev/null
real 0m0.039s
user 0m0.041s
sys 0m0.031s
Edited: added install and demo
You need to take care of at least some edge cases, like
- repeated words at the end (and beginning) of the line.
- search should be case insensitive, because of frequent errors like
The the apple
.
- probably you want to restrict search only to word constituent to not match something like
( ( a + b) + c )
(repeated opening parentheses.
- only full words should match to eliminate
the thesis
- When it comes to human language Unicode characters inside words should properly interpreted
All in all I recommend pcregrep
solution:
pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file
Obviously color and line number (n
option) is optional, but usually nice to have.
Install
On Debian-based distributions you can install via:
$ sudo apt-get install pcregrep
Example
Run the command on jefferson_typo.txt
to see:
$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly
The above is just a text capture, but on a color-supported terminal, matches are colorized:
Best Answer
Arithmetic in POSIX shells is done with
$
and double parentheses(( ))
:You can assign from that (sans
echo
):There is also
expr
:In scripting
$(())
is preferable since it avoids a fork/execute for theexpr
command.