Sed – Modify Every Non-First Word Repetition in Text

bashregular expressionsedtext processingword processing

I need to do something like that using sed?

qq    ab xyz     ab qq aa ab 

Becomes:

qq    ab xyz     +ab+ +qq+ aa +ab+

Best Answer

If your input doesn't contain <, > nor + characters, you could do:

sed '
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g'

If it may, you can always escape them:

sed '
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

Those assume you want to do that independently on each line. If you want to do it on the whole file, you'd need to load the whole file in memory first (note that some sed implementations have size limitations there):

sed '
  :2
  $!{N;b2
  }
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

That's going to be pretty inefficient though and would be a lot easier with perl:

perl -pe 's/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Line-based:

perl -pe 'my %seen;s/\w+/$seen{$&}++ ? "+$&+" : $&/ge'
Related Question