Non-line-oriented tool for string replacement

text processing

I recently asked a question about how to remove the newline character if it occurs after another specific character.

Unix text-processing tools are very powerful, but almost all of them deal with lines of text, which is fine most of the time when the input fits in the available memory.

But what should I do if I wish to replace a text sequence in a huge file that doesn't contain any newlines?

For instance replace <foobar> with \n<foobar> without reading the input line-by-line? (since there is only one line and it is 2.5G characters long).

Best Answer

The first thing that occurs to me when facing this type of problem is to change the record separator. In most tools, this is set to \n by default but that can be changed. For example:

Perl
```
perl -0x3E -pe 's/<foobar>/\n$&/' file
```
Explanation
- -0 : this sets the input record separator to a character given its hexadecimal value. In this case, I am setting it to > whose hex value is 3E. The general format is -0xHEX_VALUE. This is just a trick to break the line into manageable chunks.
- -pe : print each input line after applying the script given by -e.
- s/<foobar>/\n$&/ : a simple substitution. The $& is whatever was matched, in this case <foobar>.
awk
```
awk '{gsub(/foobar>/,"\n<foobar>");printf "%s",$0};' RS="<" file
```
Explanation
- RS="<" : set the input record separator to >.
- gsub(/foobar>/,"\n<foobar>") : substitute all cases of foobar> with \n<foobar>. Note that because RS has been set to <, all < are removed from the input file (that's how awk works) so we need to match foobar> (without a <) and replace with \n<foobar>.
- printf "%s",$0 : print the current "line" after the substitution. $0 is the current record in awk so it will hold whatever was before the <.

I tested these on a 2.3 GB, single-line file created with these commands:

for i in {1..900000}; do printf "blah blah <foobar>blah blah"; done > file
for i in {1..100}; do cat file >> file1; done
mv file1 file

Both the awk and the perl used negligible amounts of memory.

Explanation

-i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
-pe : read the input file line by line and print each line after applying the script given as -e.
s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2

String replacement in file

With a recent (for \K and s///r) perl and assuming your <string> tags don't nest:

perl -0777 -pi.bak -e's{<string.*?>\K.*?(?=</string>)}{$&=~s/-/&#8211;/rg}ges' file.xml

-0777: slurp mode: handle the whole file at once (to allow <string> tags to span several lines).
-p: sed mode
-i.bak: in-place editing with .bak extension (BTW, that's where some sed implementations got that idea from)
s{...}{...}ges: substitute globally (g), where . matches newline characters as well (s), and treat the replacement as perl code to execute (e).
<string.*?>\K.*?</string>: match from <string...> to </string> but don't include the tags themselves in the part that is matched (\K defines where the matched portion starts, and (?=...) is a look-ahead operator that only checks if </string> is there, but doesn't include it in the match).
$&=~s/.../.../rg. Do the substitution on the matched part ($&). The r flag is to actually not modify $& but return the substituted string.

Best Answer

Explanation

Explanation

Related Solutions

Sed – Replace String Containing Newline in Huge File

Explanation

String replacement in file

Related Question