Sed – Replace String Containing Newline in Huge File

newlinessedtext processing

Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.

I have a +2GB text file that I would like to process similar to what this appears to do:

sed -e 's/>\n/>/g'

That means, I want to remove all newlines that occur after a >, but not anywhere else, so that rules out tr -d.

This command (that I got from the answer of a similar question) fails with couldn't re-allocate memory :

sed --unbuffered ':a;N;$!ba;s/>\n/>/g'

So, are there any other methods without resorting to C?
I hate perl, but am willing to make an exception in this case 🙂

I don't know for sure of any character that does not occur in the data, so temporary replacing \n with another character is something I'd like to avoid if possible.

Any good ideas, anyone?

Best Answer

This really is trivial in Perl, you shouldn't hate it!

perl -i.bak -pe 's/>\n/>/' file

Explanation

  • -i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
  • -pe : read the input file line by line and print each line after applying the script given as -e.
  • s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2