I recently asked a question about how to remove the newline character if it occurs after another specific character.
Unix text-processing tools are very powerful, but almost all of them deal with lines of text, which is fine most of the time when the input fits in the available memory.
But what should I do if I wish to replace a text sequence in a huge file that doesn't contain any newlines?
For instance replace <foobar>
with \n<foobar>
without reading the input line-by-line? (since there is only one line and it is 2.5G characters long).
Best Answer
The first thing that occurs to me when facing this type of problem is to change the record separator. In most tools, this is set to
\n
by default but that can be changed. For example:Perl
Explanation
-0
: this sets the input record separator to a character given its hexadecimal value. In this case, I am setting it to>
whose hex value is3E
. The general format is-0xHEX_VALUE
. This is just a trick to break the line into manageable chunks.-pe
: print each input line after applying the script given by-e
.s/<foobar>/\n$&/
: a simple substitution. The$&
is whatever was matched, in this case<foobar>
.awk
Explanation
RS="<"
: set the input record separator to>
.gsub(/foobar>/,"\n<foobar>")
: substitute all cases offoobar>
with\n<foobar>
. Note that becauseRS
has been set to<
, all<
are removed from the input file (that's howawk
works) so we need to matchfoobar>
(without a<
) and replace with\n<foobar>
.printf "%s",$0
: print the current "line" after the substitution.$0
is the current record inawk
so it will hold whatever was before the<
.I tested these on a 2.3 GB, single-line file created with these commands:
Both the
awk
and theperl
used negligible amounts of memory.