Non-line-oriented tool for string replacement

text processing

I recently asked a question about how to remove the newline character if it occurs after another specific character.

Unix text-processing tools are very powerful, but almost all of them deal with lines of text, which is fine most of the time when the input fits in the available memory.

But what should I do if I wish to replace a text sequence in a huge file that doesn't contain any newlines?

For instance replace <foobar> with \n<foobar> without reading the input line-by-line? (since there is only one line and it is 2.5G characters long).

Best Answer

The first thing that occurs to me when facing this type of problem is to change the record separator. In most tools, this is set to \n by default but that can be changed. For example:

  1. Perl

    perl -0x3E -pe 's/<foobar>/\n$&/' file
    

    Explanation

    • -0 : this sets the input record separator to a character given its hexadecimal value. In this case, I am setting it to > whose hex value is 3E. The general format is -0xHEX_VALUE. This is just a trick to break the line into manageable chunks.
    • -pe : print each input line after applying the script given by -e.
    • s/<foobar>/\n$&/ : a simple substitution. The $& is whatever was matched, in this case <foobar>.
  2. awk

    awk '{gsub(/foobar>/,"\n<foobar>");printf "%s",$0};' RS="<" file
    

    Explanation

    • RS="<" : set the input record separator to >.
    • gsub(/foobar>/,"\n<foobar>") : substitute all cases of foobar> with \n<foobar>. Note that because RS has been set to <, all < are removed from the input file (that's how awk works) so we need to match foobar> (without a <) and replace with \n<foobar>.
    • printf "%s",$0 : print the current "line" after the substitution. $0 is the current record in awk so it will hold whatever was before the <.

I tested these on a 2.3 GB, single-line file created with these commands:

for i in {1..900000}; do printf "blah blah <foobar>blah blah"; done > file
for i in {1..100}; do cat file >> file1; done
mv file1 file

Both the awk and the perl used negligible amounts of memory.

Related Question