Sed – Replace String Containing Newline in Huge File

newlinessedtext processing

Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.

I have a +2GB text file that I would like to process similar to what this appears to do:

sed -e 's/>\n/>/g'

That means, I want to remove all newlines that occur after a >, but not anywhere else, so that rules out tr -d.

This command (that I got from the answer of a similar question) fails with couldn't re-allocate memory :

sed --unbuffered ':a;N;$!ba;s/>\n/>/g'

So, are there any other methods without resorting to C?
I hate perl, but am willing to make an exception in this case 🙂

I don't know for sure of any character that does not occur in the data, so temporary replacing \n with another character is something I'd like to avoid if possible.

Any good ideas, anyone?

Best Answer

This really is trivial in Perl, you shouldn't hate it!

perl -i.bak -pe 's/>\n/>/' file

Explanation

-i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
-pe : read the input file line by line and print each line after applying the script given as -e.
s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Try using sort with the -o/--output=FILE option instead of redirecting the output. You might also try setting the buffer-size with the -S/--buffer-size=SIZE. Also, try -s/--stable. And read the man page, it offers all of the info I gave.

The full command you can use that might work for what you're doing:

sort -us -o wordlist_unique.lst wordlist.lst

You might also want to read the following URL:

http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

That more thoroughly explains sort than the man page.

Replace \n by a newline in sed portably

sed 's/\\n/\
/g'

Notice the backslash just before hitting return in the replacement string.

Best Answer

Explanation

Related Solutions

linux text-processing uniq – How to Remove Duplicate Lines in a Large Multi-GB Text File

Replace \n by a newline in sed portably

Related Question