Substituting strings in a very large file

sedtext processing

I have a very long series of urls with no separating character, in the same format as below:

http://example.comhttp://example.nethttp://example.orghttp://etc...

I want each URL to be on a new line. I tried to do this by replacing all instances of "http://" with "\nhttp://" using sed

sed 's_http://_\nhttp://_g' urls.txt

but a segmentation fault occurs (memory violation). I can only surmise that the sheer size of the file (it's over 100GB) is causing sed to exceed some limit.

I could split the file into several smaller files for processing, but all instances of "http://" would need to be kept intact.

Is there a better way to do this?

Best Answer

With awk you can avoid reading huge amount of text at once:

awk -vRS='http://' -vORS='\nhttp://' 1 urls.txt > urlsperline.txt

The success may depend on the used awk implementation. For example gawk works fine, but mawk crashes.

Related Question