Text Processing – Replace First k Instances of a Word in a File with sed

awksedtext processing

I want to replace only the first k instances of a word.

How can I do this?

Eg. Say file foo.txt contains 100 instances occurrences of word 'linux' .

I need to replace first 50 occurrences only.

Best Answer

The first section belows describes using sed to change the first k-occurrences on a line. The second section extends this approach to change only the first k-occurrences in a file, regardless of what line they appear on.

Line-oriented solution

With standard sed, there is a command to replace the k-th occurrance of a word on a line. If k is 3, for example:

sed 's/old/new/3'

Or, one can replace all occurrences with:

sed 's/old/new/g'

Neither of these is what you want.

GNU sed offers an extension that will change the k-th occurrance and all after that. If k is 3, for example:

sed 's/old/new/g3'

These can be combined to do what you want. To change the first 3 occurrences:

$ echo old old old old old | sed -E 's/\<old\>/\n/g4; s/\<old\>/new/g; s/\n/old/g'
new new new old old

where \n is useful here because we can be sure that it never occurs on a line.

Explanation:

We use three sed substitution commands:

s/\<old\>/\n/g4

This the GNU extension to replace the fourth and all subsequent occurrences of old with \n.

The extended regex feature \< is used to match the beginning of a word and \> to match the end of a word. This assures that only complete words are matched. Extended regex requires the -E option to sed.
s/\<old\>/new/g

Only the first three occurrences of old remain and this replaces them all with new.
s/\n/old/g

The fourth and all remaining occurrences of old were replaced with \n in the first step. This returns them back to their original state.

Non-GNU solution

If GNU sed is not available and you want to change the first 3 occurrences of old to new, then use three s commands:

$ echo old old old old old | sed -E -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'
new new new old old

This works well when k is a small number but scales poorly to large k.

Since some non-GNU seds do not support combining commands with semicolons, each command here is introduced with its own -e option. It may also be necessary to verify that your sed supports the word boundary symbols, \< and \>.

File-oriented solution

We can tell sed to read the whole file in and then perform the substitutions. For example, to replace the first three occurrences of old using a BSD-style sed:

sed -E -e 'H;1h;$!d;x' -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'

The sed commands H;1h;$!d;x read the whole file in.

Because the above does not use any GNU extension, it should work on BSD (OSX) sed. Note, thought, that this approach requires a sed that can handle long lines. GNU sed should be fine. Those using a non-GNU version of sed should test its ability to handle long lines.

With a GNU sed, we can further use the g trick described above, but with \n replaced with \x00, to replace the first three occurrences:

sed -E -e 'H;1h;$!d;x; s/\<old\>/\x00/g4; s/\<old\>/new/g; s/\x00/old/g'

This approach scales well as k becomes large. This assumes, though, that \x00 is not in your original string. Since it is impossible to put the character \x00 in a bash string, this is usually a safe assumption.

Related Solutions

Regex & Sed/Perl: Match word that ISN’T preceded by another word

Would be easy with any language where the regular expressions are capable to lookbehind. Of course, Perl is the first on list:

perl -pe 's/(?<!John\W)Smith/John/g' <<< "John Smith and Jane Johnson talk about Smith's car."

The weak point is having more than one non-word character between “John” and “Smith”. Unfortunately a quantifier like + for \W would raise “Variable length lookbehind not implemented” error.

Replace all letters in a word with ‘*’ after a certain word in a text file

You could do it one at a time in a loop:

sed -e :1 -e 's/\(Word: *[^ ]*\)[^ *]/\1*/;t1'

To edit the file in place and assuming all the characters to replace are single-byte, you can make both sed's stdin and stdout the file as in:

sed ... < file 1<> file

Or with GNU sed, use the -i flag as in:

sed -i ... file

(Though that will replace the file with a new one (though with the same name). With BSD sed, use -i '' instead of -i).