How do I remove the first 300 million lines from a 700 GB text file
on a system with 1 TB disk space total, with 300 GB available?
(My system has 2 GB of memory.)
The answers I found use sed, tail, head:
- How do I delete the first n lines of a text file using shell commands?
- Remove the first n lines of a large text file
But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.
The file contains database records in JSON format.
Best Answer
If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:
That will first
gzip
the original input file (file
) to createfile.gz
. Then, youzcat
the newly createdfile.gz
, pipe it throughtail -n +300000001
to remove the first 3M lines, compress the result to save disk space and save it asnewFile.gz
. The&&
ensures that you only continue if thegzip
operation was successful (it will fail if you run out of space).Note that text files are very compressible. For example, I created a test file using
seq 400000000 > file
, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and thenewFile.gz
I created only 213M.