Replace string in a huge (70GB), one line, text file

large filessedtext processing

I have a huge (70GB), one line, text file and I want to replace a string (token) in it.
I want to replace the token <unk>, with another dummy token (glove issue).

I tried sed:

sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but the output file corpus.txt.new has zero-bytes!

I also tried using perl:

perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but I got an out of memory error.

For smaller files, both of the above commands work.

How can I replace a string is such a file?
This is a related question, but none of the answers worked for me.

Edit:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?

Best Answer

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

Related Solutions

Shell – Find and replace with contents of a file

Here's a simple recursive version with awk. You must create a script in your PATH with

#!/bin/bash
awk '
$1=="include" && NF>=2 {
   system("'$0' " $2)
   next
}
{print}' "$@"

It assumes filenames have no special chars (including spaces) in them. The awk checks the first word for include, then calls the script to process the file given as 2nd word. Other lines are printed. Note that the $0 here is outside the single quotes of the awk, so is a shell $0, ie the script name.

How to use sed or ex to replace a block (multi-line code) with new block of text (code)

I suggest using the change command (which is essentially a delete coupled with an append, though the append is only applied for the last line in the range which is exactly what you want here):

sed -i '/marker1/,/marker2/c\
New text 1\
New text 2' filename

Here using GNU sed's syntax for in-place editing (-i). That c command is otherwise standard and portable. GNU sed supports:

sed '/marker1/,/marker2/cNew text 1\
New text 2' filename

as a non-standard extension.

Newline and backslash characters must be escaped (with backslash) in the replacement text.

Best Answer

Related Solutions

Shell – Find and replace with contents of a file

How to use sed or ex to replace a block (multi-line code) with new block of text (code)

Related Question