Sed – Modify Every Non-First Word Repetition in Text

bashregular expressionsedtext processingword processing

I need to do something like that using sed?

qq    ab xyz     ab qq aa ab

Becomes:

qq    ab xyz     +ab+ +qq+ aa +ab+

Best Answer

If your input doesn't contain <, > nor + characters, you could do:

sed '
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g'

If it may, you can always escape them:

sed '
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

Those assume you want to do that independently on each line. If you want to do it on the whole file, you'd need to load the whole file in memory first (note that some sed implementations have size limitations there):

sed '
  :2
  $!{N;b2
  }
  s/:/::/g;s/</:{/g;s/>/:}/g
  s/[[:alnum:]]\{1,\}/<&>/g;:1
  s/\(<\([^>]*\)>.*\)<\2>/\1+\2+/;t1
  s/[<>]//g
  s/:}/>/g;s/:{/</g;s/::/:/g'

That's going to be pretty inefficient though and would be a lot easier with perl:

perl -pe 's/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Line-based:

perl -pe 'my %seen;s/\w+/$seen{$&}++ ? "+$&+" : $&/ge'

Related Solutions

Sed -e ‘s/^[0-9]//’ does not work for the first line

Your file starts with a UTF-8 byte order mark. It is unicode symbol U+FEFF which is encoded as three bytes in UTF-8. Those three bytes show up as 357 273 277 when you print them in base 8.

To the sed command those bytes at the start of the line means that 1 is in fact not the first character on that line. Many other tools will treat it the same way.

You need to remove the BOM before doing other processing in order to get a useful result. For instance you could start your sed script with s/^\xef\xbb\xbf// to remove the BOM. Your full command would then become

sed -e 's/^\xef\xbb\xbf//;s/^[0-9]//'

Return word before a matched word using sed

If you only need to handle that one line, you could use the sed command

sed -e 's/.* \([[:digit:]]\{1,\}\) processes running\./\1/'

For a slightly more robust approach, the following script will accept arbitrary input and only respond if something matched

sed -ne 's/.* \([[:digit:]]\{1,\}\) processes running\./\1/p'

Best Answer

Related Solutions

Sed -e ‘s/^[0-9]//’ does not work for the first line

Return word before a matched word using sed

Related Question