Text Processing – Remove First 300 Million Lines from a 700 GB txt File

awkfilesheadsedtail

How do I remove the first 300 million lines from a 700 GB text file
on a system with 1 TB disk space total, with 300 GB available? 
(My system has 2 GB of memory.) 
The answers I found use sed, tail, head:

But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.

The file contains database records in JSON format.

Best Answer

If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:

gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz

That will first gzip the original input file (file) to create file.gz. Then, you zcat the newly created file.gz, pipe it through tail -n +300000001 to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz. The && ensures that you only continue if the gzip operation was successful (it will fail if you run out of space).

Note that text files are very compressible. For example, I created a test file using seq 400000000 > file, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz I created only 213M.

Related Question