Text Processing – Replace First k Instances of a Word in a File with sed

awksedtext processing

I want to replace only the first k instances of a word.

How can I do this?

Eg. Say file foo.txt contains 100 instances occurrences of word 'linux' .

I need to replace first 50 occurrences only.

Best Answer

The first section belows describes using sed to change the first k-occurrences on a line. The second section extends this approach to change only the first k-occurrences in a file, regardless of what line they appear on.

Line-oriented solution

With standard sed, there is a command to replace the k-th occurrance of a word on a line. If k is 3, for example:

sed 's/old/new/3'

Or, one can replace all occurrences with:

sed 's/old/new/g'

Neither of these is what you want.

GNU sed offers an extension that will change the k-th occurrance and all after that. If k is 3, for example:

sed 's/old/new/g3'

These can be combined to do what you want. To change the first 3 occurrences:

$ echo old old old old old | sed -E 's/\<old\>/\n/g4; s/\<old\>/new/g; s/\n/old/g'
new new new old old

where \n is useful here because we can be sure that it never occurs on a line.

Explanation:

We use three sed substitution commands:

  • s/\<old\>/\n/g4

    This the GNU extension to replace the fourth and all subsequent occurrences of old with \n.

    The extended regex feature \< is used to match the beginning of a word and \> to match the end of a word. This assures that only complete words are matched. Extended regex requires the -E option to sed.

  • s/\<old\>/new/g

    Only the first three occurrences of old remain and this replaces them all with new.

  • s/\n/old/g

    The fourth and all remaining occurrences of old were replaced with \n in the first step. This returns them back to their original state.

Non-GNU solution

If GNU sed is not available and you want to change the first 3 occurrences of old to new, then use three s commands:

$ echo old old old old old | sed -E -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'
new new new old old

This works well when k is a small number but scales poorly to large k.

Since some non-GNU seds do not support combining commands with semicolons, each command here is introduced with its own -e option. It may also be necessary to verify that your sed supports the word boundary symbols, \< and \>.

File-oriented solution

We can tell sed to read the whole file in and then perform the substitutions. For example, to replace the first three occurrences of old using a BSD-style sed:

sed -E -e 'H;1h;$!d;x' -e 's/\<old\>/new/' -e 's/\<old\>/new/' -e 's/\<old\>/new/'

The sed commands H;1h;$!d;x read the whole file in.

Because the above does not use any GNU extension, it should work on BSD (OSX) sed. Note, thought, that this approach requires a sed that can handle long lines. GNU sed should be fine. Those using a non-GNU version of sed should test its ability to handle long lines.

With a GNU sed, we can further use the g trick described above, but with \n replaced with \x00, to replace the first three occurrences:

sed -E -e 'H;1h;$!d;x; s/\<old\>/\x00/g4; s/\<old\>/new/g; s/\x00/old/g'

This approach scales well as k becomes large. This assumes, though, that \x00 is not in your original string. Since it is impossible to put the character \x00 in a bash string, this is usually a safe assumption.

Related Question